An Introduction to Regularization Methods

Introduction

Section 1: The Problem with “Best Fit”

Changing Landscape of Data

  • Traditional Research: Small \(N\) (participants), Small \(p\) (variables).
    • Focus: Theory-driven hypothesis testing.
  • The New Reality: “Wide” Data: Log-trace data, dense surveys, genomic markers.
    • High-Dimensional Prediction: \(p\) is large relative to \(N\).
  • The Consequence: Traditional methods like OLS struggle to distinguish signal from noise.

The Status Quo: Ordinary Least Squares (OLS)

  • The Gold Standard: OLS is designed for inference in large samples with few predictors.
  • The Objective: Unbiased estimates with minimum standard error.
  • The Mechanism: Minimizes the Residual Sum of Squares (RSS).
    • It tries to pass a line strictly through the “center” of the data points.

The OLS Trap in High Dimensions

  • Capitalizing on Chance: When predictors (\(p\)) are many, OLS finds spurious relationships by chance.
  • Overfitting: The model “memorizes” the training data rather than learning the underlying pattern.
  • The Symptom: Excellent fit on current data (\(R^2\) is high).
    • Poor performance on new data (Prediction fails).

Visualizing the Problem

  • Left: OLS works well with sparse data.
  • Right: As dimensions expand, the connections become too complex for simple minimization.

The Goal: Generalizability

  • Shift in Perspective: Moving from “Is this coefficient significant?” to “Does this model generalize?”.
  • The Trade-off: To improve prediction on future data, we must accept some error on current data.
  • The Solution: We need a mathematical way to constrain the model—to stop it from chasing noise.

Concept

Section 2: The Solution - Regularization

The Bias-Variance Trade-off

  • Bias: Error from erroneous assumptions (e.g., missing a relationship).
  • Variance: Error from sensitivity to small fluctuations in the training set.
  • The Conflict: OLS = Low Bias, High Variance.
    • Regularization = Higher Bias, Low Variance.

Introducing “Bias” Intentionally

  • Why Add Bias? By restricting the model, we stabilize the estimates.
    • We prevent the model from reacting to random noise.
  • The Result: A model that is slightly “wrong” on average (biased) but consistently closer to the truth (low variance) across different samples.

The Math: Redefining the Cost Function

\[\text{Regularized Cost} = \text{[RSS] or [-LL]} + \text{Penalty}\]

  • RSS: The standard OLS term (Fit the data).
  • Penalty: The constraint term (Keep coefficients small).
  • The Mechanism: The model must “pay” a price for every coefficient it estimates.

The Specifics: Linear Models with OLS

The least squares cost function:

\[\text{RSS} = \sum_{p=1}^{N}\left(Y_p - \left(\beta_0 + \sum_{i=1}^{p} \beta_i X_{ip}\right)\right)^2\]

The Specifics: Generalized Linear Models with Maximum Likelihood

The ML likelihood function:

\[\text{LL} = \sum_{p=1}^{N} \log P(Y_p | X_p, \boldsymbol{\beta})\]

  • Here, \(P(Y_p | X_p, \boldsymbol{\beta})\) is the PDF of the outcome

The Cost Function

\[\text{Regularized Cost} = \text{[RSS] or [-LL]} + \text{Penalty}\]

  • Penalty: A function of the coefficients (\(\beta\)s) that increases as coefficients grow.
  • Types of Penalties:
    • Ridge Regression: \(\lambda \sum_{i=1}^{P} \beta_i^2\) (L2 norm)
    • LASSO: \(\lambda \sum_{i=1}^{P} |\beta_i|\) (L1 norm)
    • Elastic Net: \(\lambda \left(\alpha \sum_{i=1}^{P} |\beta_i| + (1-\alpha) \sum_{i=1}^{P} \beta_i^2\right)\)

NOTE: \(\lambda\) and \(\alpha\) are set by the user (and typically ranges of \(\lambda\) are tested)

  • The Goal: Minimize the Regularized Cost by choosing optimal \(\beta\)s.

The Tuning Parameter: Lambda (\(\lambda\))

  • Lambda (\(\lambda\)): Controls the strength of the penalty.
  • \(\lambda = 0\): No penalty. The result is identical to OLS.
  • High \(\lambda\): Strong penalty. Coefficients are shrunken heavily toward zero (High Bias).
  • The Goal: Find the “Goldilocks” \(\lambda\)—not too simple, not too complex.

Techniques

Section 3: Ridge Regression

Ridge Regression (\(L_2\) Penalty)

  • The Formula: Adds the sum of squared coefficients to the cost function. \[\dots + \lambda \sum \beta^2\]
  • Behavior: Shrinks coefficients toward zero but rarely sets them exactly to zero.
    • “Democratic” shrinkage: It reduces the impact of all variables proportionally.

Geometric Intuition: The Circle

  • Constraint Region: \(\sum \beta^2\) creates a circular constraint.
  • The Result: The solution touches the circle but rarely hits the axis (zero) exactly.

When to Use Ridge?

  • Best Use Case: High Multicollinearity.
    • Example: Multiple survey items measuring the same construct.
  • Handling Correlation: OLS would produce unstable estimates with huge standard errors.
    • Ridge shares the credit among correlated predictors, shrinking them together.

Techniques

Section 4: LASSO Regression

LASSO Regression (\(L_1\) Penalty)

  • The Formula: Adds the sum of absolute coefficients to the cost function. \[\dots + \lambda \sum |\beta|\]
  • Behavior: * Forces coefficients exactly to zero.
    • Performs Variable Selection automatically.

Geometric Intuition: The Diamond

  • Constraint Region: \(\sum |\beta|\) creates a diamond-shaped constraint.
  • The Result: The corners of the diamond hit the axes.
    • Hitting an axis means the coefficient for that variable becomes zero.

The “Bet on Sparsity”

  • Assumption: The underlying truth is “sparse”—only a few variables truly matter.
  • Reality in Social Science: Data is often dense (everything correlates with everything).
  • Functional Sparsity: We prioritize a parsimonious model we can actually interpret.

Example: Personality Data

  • Context: Predicting an outcome using Big Five personality items (\(p=25\)) + Education.
  • Result: LASSO sets nearly half the predictors to zero.
  • Benefit: Filters the signal from the noise, leaving a cleaner model.

Implementation

Section 5: Practical Application

Prerequisite: Standardization

  • The Problem: The penalty is uniform. Large-scale variables (Income) have small coefficients; Small-scale variables (GPA) have large coefficients.
  • The Consequence: Without scaling, the penalty unfairly crushes variables with small natural scales.
  • The Fix: Standardize (Center and Scale) all predictors to Mean=0, SD=1. Categorical variables must be dummy coded before standardization.

Choosing Lambda: Cross-Validation

  • Process:
    1. Split data into folds (e.g., 10-fold CV).
    2. Test a sequence of \(\lambda\) values.
    3. Calculate Mean Squared Error (MSE) for each \(\lambda\).
  • Visualization: The “CV Path” shows error changing as complexity changes.

The “One Standard Error Rule”

  • Min MSE: The \(\lambda\) value that minimizes error mathematically.
  • 1SE Rule: Select the most parsimonious model (higher \(\lambda\)) whose error is within 1 Standard Error of the minimum.
  • Why? Err on the side of simplicity. If a simpler model is statistically indistinguishable from the complex one, pick the simple one.

Advanced Methods

Section 6: Extensions

Elastic Net: Best of Both Worlds

  • The Limitation of LASSO: If variables are highly correlated, LASSO picks one arbitrarily and drops the rest.
  • The Solution: Elastic Net mixes L1 (Lasso) and L2 (Ridge) penalties.
    • \(\alpha\) parameter controls the mix (\(\alpha=0.5\) is equal mix).
  • Outcome: Performs selection (Lasso) but keeps correlated groups together (Ridge).

Stability Selection

  • The Problem: Variable selection can be unstable. Changing the data split slightly changes which variables are picked.
  • The Fix:
    1. Subsample data 1,000 times.
    2. Run selection on each subsample.
    3. Keep variables with high Selection Probability.

Hierarchical LASSO: Interactions

  • Context: Testing moderation (e.g., Treatment \(\times\) Prior Knowledge).
  • Challenge: With many predictors, checking all interactions is impossible manually.
  • Solution: Hierarchical LASSO automates interaction search.
  • Strong Hierarchy: An interaction is only “unlocked” if its main effects are also selected.

Group LASSO

  • Context: Categorical variables (e.g., Race/Ethnicity dummy codes) or psychometric scales.
  • Problem: Standard LASSO might drop “Hispanic” but keep “Asian,” breaking the variable structure.
  • Solution: Applies penalty to the Group of variables. They are either all included or all dropped.

Conclusion: A New Standard

  • Embrace Complexity: Regularization allows us to analyze high-dimensional data without being fooled by it.
  • Filter Signal from Noise: It provides a principled way to reduce complex data to interpretable models.
  • The Future: Moving towards robust, generalizable prediction in the Learning Sciences.