Lasso vs Ridge Regression: A Paper-and-Pen Explanation with Numbers
This document walks you through, step-by-step and in detail, why Lasso (L1) regularization can produce exact zero coefficients while Ridge (L2) regularization only shrinks coefficients but never makes them exactly zero. Every algebra step is shown with numbers (imputation), plus explanations of the reasoning behind each step. 1. Problem setup (single-feature, standardized) We keep the math intentionally simple so the algebra is transparent. Assume: • A single predictor (feature). • Feature is standardized so that XᵀX = 1 (this simplifies algebra). • Denote the correlation of the feature with the target as z = Xᵀy. We will solve for the single coefficient β (beta). 2. Ordinary ridge regression (L2) — derivation and numeric example Objective (scalar feature): L(β) = (y − Xβ)² + λβ² Expand the squared error term (brief reminder): (y − Xβ)² = yᵀy − 2β Xᵀy + β² XᵀX Using XᵀX = 1 and Xᵀy = z, the objective becomes: L(β) = yᵀy − 2zβ + β² + λβ² Drop the constant yᵀy (does not af...