This document walks you through, step-by-step and in detail, why Lasso (L1) regularization can produce exact zero coefficients while Ridge (L2) regularization only shrinks coefficients but never makes them exactly zero. Every algebra step is shown with numbers (imputation), plus explanations of the reasoning behind each step.
1. Problem setup (single-feature, standardized)
We keep the math intentionally simple so the algebra is transparent. Assume:
• A single predictor (feature).
• Feature is standardized so that XᵀX = 1 (this simplifies algebra).
• Denote the correlation of the feature with the target as z = Xᵀy.
We will solve for the single coefficient β (beta).
2. Ordinary ridge regression (L2) — derivation and numeric example
Objective (scalar feature):
L(β) = (y − Xβ)² + λβ²
Expand the squared error term (brief reminder):
(y − Xβ)² = yᵀy − 2β Xᵀy + β² XᵀX
Using XᵀX = 1 and Xᵀy = z, the objective becomes:
L(β) = yᵀy − 2zβ + β² + λβ²
Drop the constant yᵀy (does not affect minimization):
L(β) = (1 + λ) β² − 2 z β
Differentiate w.r.t β and set derivative to zero:
dL/dβ = 2(1 + λ) β − 2 z = 0 ⇒ β = z / (1 + λ)
Interpretation (algebraic):
• The coefficient is scaled by 1/(1+λ). As λ increases, β shrinks towards 0.
• There is no case analysis and no absolute value. The derivative is smooth for all β.
Numeric example: choose z = 0.6, λ = 1
β_ridge = z / (1 + λ) = 0.6 / 2 = 0.3
So Ridge returns β = 0.3 (non-zero). It is shrunk vs. the unregularized value (z).
3. Lasso regression (L1) — setup and why absolute value matters
Objective (scalar feature):
L(β) = (y − Xβ)² + λ |β|
Expand the squared term and substitute XᵀX = 1, Xᵀy = z:
L(β) = β² − 2 z β + λ |β| (constant yᵀy dropped)
The absolute value |β| is not differentiable at β = 0. That forces a case-by-case analysis:
Region A: β > 0 (so |β| = β).
Region B: β < 0 (so |β| = −β).
Region C: β = 0 (special point — must be checked directly).
4. Full step-by-step solution with numbers (λ = 1, then λ = 2)
We will first plug in numerical z and λ values and show every algebra step so you can see when β becomes exactly zero.
Let z = 0.6 (same as Ridge example). We'll consider two penalties: λ = 1 and λ = 2.
Case: λ = 1 (moderate penalty)
Objective: L(β) = β² − 2(0.6)β + 1 |β| = β² − 1.2 β + |β|
Region A: β > 0 ⇒ |β| = β. Then:
L(β) = β² − 1.2β + β = β² − 0.2 β
Derivative: dL/dβ = 2β − 0.2. Set to zero: 2β − 0.2 = 0 ⇒ β = 0.1
Check validity: β = 0.1 is indeed > 0, so this is a valid stationary point in Region A.
Region B: β < 0 ⇒ |β| = −β. Then:
L(β) = β² − 1.2β − β = β² − 2.2 β
Derivative: dL/dβ = 2β − 2.2 ⇒ set =0 ⇒ β = 1.1 (which is not < 0).
Thus Region B yields no valid solution.
Region C: β = 0 must also be checked. Compute objective values:
L(0) = 0 (since β² − 1.2β + |β| = 0).
L(0.1) = (0.1)² − 1.2(0.1) + 0.1 = 0.01 − 0.12 + 0.1 = −0.01
Since L(0.1) < L(0), the minimizing β is β = 0.1 (non-zero).
Conclusion for λ = 1: Lasso returns β ≈ 0.1 (non-zero but shrunk relative to z).
Case: λ = 2 (stronger penalty) — we will see β = 0
Objective: L(β) = β² − 1.2β + 2|β|
Region A: β > 0 ⇒ |β| = β:
L(β) = β² − 1.2β + 2β = β² + 0.8 β
Derivative: dL/dβ = 2β + 0.8. Setting to zero gives β = −0.4 (invalid, since we assumed β > 0).
So no stationary point in Region A.
Region B: β < 0 ⇒ |β| = −β:
L(β) = β² − 1.2β − 2β = β² − 3.2β
Derivative: dL/dβ = 2β − 3.2 ⇒ set to zero gives β = 1.6 (invalid, since we assumed β < 0).
So no stationary point in Region B.
Because there are no valid stationary points in either open region, we must check β = 0 (Region C).
Evaluate L(0): L(0) = 0 (plugging in β = 0).
Since no valid minima exist in Regions A or B, and L(0) is finite and achievable, β = 0 is the minimizing solution.
Conclusion for λ = 2: Lasso returns β = 0 (exact zero).
5. Subgradient formalism (formal verification)
Because |β| is not differentiable at 0, we use subgradients to confirm optimality at β = 0.
Optimality condition for our objective (scalar case):
0 ∈ 2β − 2z + λ ∂|β|
Where ∂|β| is the subgradient set of absolute value:
• If β > 0, ∂|β| = {1}.
• If β < 0, ∂|β| = {−1}.
• If β = 0, ∂|β| = [−1, +1] (the entire interval).
Plugging β = 0 into the optimality condition gives:
0 ∈ − 2 z + λ [−1, 1] ⇒ 2 z ∈ λ [−1, 1] ⇒ |2 z| ≤ λ
With our numbers (z = 0.6, λ = 2): |2 z| = 1.2 ≤ 2, so β = 0 satisfies the subgradient condition and is optimal.
This is the precise algebraic statement of the earlier region exhaustion argument.
6. Geometric intuition (short)
In the multi-dimensional case the intuition is particularly clear:
• Ridge: the L2 penalty contours are circles (or ellipsoids). The unconstrained loss contours are ellipses. Their tangency rarely occurs on a coordinate axis, so coefficients are shrunk but typically non-zero.
• Lasso: the L1 penalty contours are diamonds (cross-polytopes). The corners (points where a coordinate is zero) are aligned with axes. The minimizing contour frequently touches a corner, producing exact zeros.
The 1D case we solved algebraically is the simplest demonstration of how the L1 kink enforces an axis-aligned minimum.
7. Comparison table (numeric recap)
Method | λ | β (result) |
Ridge | 1 | 0.3 |
Lasso | 1 | 0.1 |
Lasso | 2 | 0.0 |
8. Detailed takeaways
• 'Lasso (L1) creates a non-differentiable kink at zero leading to soft-thresholding; when the penalty dominates the feature correlation, the optimal coefficient is exactly zero. Ridge (L2) uses smooth quadratic penalization which only shrinks coefficients.'
• Practical tip: Lasso is useful for feature selection (sparse coefficients). Ridge is useful when many small coefficients are desired (no sparsity).
• Numeric tip: use standardization of predictors before applying Lasso so that penalty is applied uniformly across features.

Comments
Post a Comment