L1 vs. L2 Regularization: What's the Difference?

Regularization is one of those topics that seems simple on the surface — “just add a penalty term to the loss” — but the specific shape of that penalty has profound and quite different consequences depending on whether you choose L1 or L2. In this article, I’ll walk through both from first principles, show you the math and the geometry behind why they behave so differently, and help you decide which one fits your situation.

The Basic Idea of Regularization

When training a model, we minimize a loss function $J(\theta)$ that measures prediction error on the training data. Left unconstrained, a sufficiently flexible model (like a large neural network) can drive this training loss arbitrarily close to zero by fitting noise as well as signal — a recipe for overfitting.

Regularization adds a penalty term to the loss that discourages overly complex solutions, typically by discouraging large parameter values:

$$J_{regularized}(\theta) = J(\theta) + \lambda R(\theta)$$

where $R(\theta)$ is the regularization term and $\lambda$ is a hyperparameter controlling how strongly it’s enforced. L1 and L2 regularization differ in exactly how $R(\theta)$ is defined.

L2 Regularization (Ridge / Weight Decay)

L2 regularization adds a penalty proportional to the sum of squared weights:

$$R_{L2}(\theta) = \sum_{i} \theta_i^2 = |\theta|_2^2$$

$$J_{L2}(\theta) = J(\theta) + \lambda \sum_i \theta_i^2$$

Taking the gradient of this penalty with respect to a single weight $\theta_i$ gives:

$$\frac{\partial R_{L2}}{\partial \theta_i} = 2\theta_i$$

So the gradient update becomes:

$$\theta_i \leftarrow \theta_i – \eta \left( \frac{\partial J}{\partial \theta_i} + 2\lambda \theta_i \right) = \theta_i(1 – 2\eta\lambda) – \eta \frac{\partial J}{\partial \theta_i}$$

Notice the term $\theta_i (1 – 2\eta\lambda)$: on every update, the weight is multiplicatively shrunk toward zero by a small factor, in addition to the usual gradient-based update. This is exactly why L2 regularization is often called weight decay — every weight decays a little on every step, proportional to its own current magnitude.

L1 Regularization (Lasso)

L1 regularization adds a penalty proportional to the sum of absolute values of the weights:

$$R_{L1}(\theta) = \sum_i |\theta_i| = |\theta|_1$$

$$J_{L1}(\theta) = J(\theta) + \lambda \sum_i |\theta_i|$$

The gradient of the absolute value function is the sign of the weight (constant magnitude, regardless of how large the weight currently is):

$$\frac{\partial R_{L1}}{\partial \theta_i} = \text{sign}(\theta_i)$$

So the update becomes:

$$\theta_i \leftarrow \theta_i – \eta \left( \frac{\partial J}{\partial \theta_i} + \lambda , \text{sign}(\theta_i) \right)$$

Unlike L2’s proportional shrinkage, L1 subtracts a constant amount from each weight on every step (in the direction that pushes it toward zero), regardless of the weight’s current magnitude. This constant, non-proportional pressure is what causes L1’s most distinctive behavior: it can push small weights all the way to exactly zero, effectively removing them from the model entirely.

The Key Difference: Sparsity

This is the single most important distinction between the two:

L1 regularization tends to produce sparse solutions — many weights become exactly zero, effectively performing automatic feature selection.
L2 regularization tends to shrink all weights smoothly toward small values, but rarely drives them to exactly zero.

Geometric Intuition

A classic way to visualize the difference is to think of the regularization term as a constraint region and the loss function’s contours as ellipses centered on the unregularized optimum.

The L2 penalty constraint region is a circle (or hypersphere in higher dimensions): $\theta_1^2 + \theta_2^2 \le c$.
The L1 penalty constraint region is a diamond (or hyperoctahedron in higher dimensions): $|\theta_1| + |\theta_2| \le c$.

When you find the point where the loss contours first touch the constraint region, the sharp corners of the L1 diamond sit exactly on the coordinate axes. This geometric property makes it much more likely that the optimal solution touches the constraint region precisely at a corner — where one or more coordinates are exactly zero. The smooth, round L2 circle has no such corners, so the optimal touching point almost never lands exactly on an axis; instead, it shrinks all coordinates toward small but nonzero values.

Side-by-Side Comparison

Property	L1 Regularization	L2 Regularization
Penalty term	$\sum_i \lvert \theta_i \rvert$	$\sum_i \theta_i^2$
Gradient of penalty	Constant ($\text{sign}(\theta_i)$)	Proportional to $\theta_i$
Effect on weights	Drives many weights to exactly zero	Shrinks all weights smoothly, rarely to zero
Resulting model	Sparse (fewer active features)	Dense (all features retained, but small)
Feature selection	Yes, automatic	No
Robustness to outliers in weights	More robust to a few large weights	Penalizes large weights very heavily (quadratic)
Solution uniqueness	Can have multiple equally optimal sparse solutions	Typically a unique, smooth solution
Common name	Lasso	Ridge / weight decay
Differentiability at zero	Not differentiable at $\theta_i = 0$	Fully differentiable everywhere

The Bayesian Interpretation: Priors on Weights

There’s an elegant probabilistic way to understand both penalties, which also explains why they produce such different behavior. Regularization can be viewed as placing a prior distribution on the model’s weights and finding the maximum a posteriori (MAP) estimate rather than the pure maximum likelihood estimate.

L2 regularization corresponds to a Gaussian prior on each weight: $\theta_i \sim \mathcal{N}(0, \sigma^2)$. The negative log of a Gaussian density is proportional to $\theta_i^2$, exactly matching the L2 penalty form. A Gaussian prior believes weights are most likely to be close to zero, with a smoothly decreasing likelihood as they grow larger in either direction — but it never assigns exactly zero probability density to any specific nonzero value, which is consistent with L2 rarely producing exact zeros.
L1 regularization corresponds to a Laplace prior on each weight: $\theta_i \sim \text{Laplace}(0, b)$. The negative log of a Laplace density is proportional to $|\theta_i|$, matching the L1 penalty exactly. Crucially, the Laplace distribution has a sharp peak (technically, a non-differentiable point) at zero, placing much more probability density mass exactly at zero than a Gaussian does — this is the probabilistic root of why L1 produces genuinely sparse solutions.

This Bayesian framing is more than just a mathematical curiosity — it clarifies why the two penalties behave so differently at a deeper level than the optimization mechanics alone.

What Happens as $\lambda$ Increases: A Walkthrough

It helps to trace through what happens to a simple two-weight model as you gradually increase the regularization strength $\lambda$ for each penalty type:

$\lambda$	Effect under L2	Effect under L1
0 (no regularization)	Weights fit training data exactly	Weights fit training data exactly
Small	All weights shrink slightly, none reach zero	Least useful weights start approaching zero
Moderate	All weights shrink further, still nonzero	Several weights become exactly zero; remaining weights adjust to compensate
Large	Weights shrink dramatically toward (but not exactly) zero	Most weights become exactly zero; only the most predictive few remain
Very large	All weights nearly zero, severe underfitting	All weights exactly zero, model predicts a constant

Correlated Features: Where the Two Penalties Diverge Most

One of the most practically important differences between L1 and L2 shows up when your input features are highly correlated with each other. Suppose two features are nearly identical (highly correlated) and both are genuinely predictive of the target.

Under L2, the penalty tends to distribute the weight roughly evenly across both correlated features, since $\theta_1^2 + \theta_2^2$ is minimized (for a fixed sum $\theta_1 + \theta_2$) when $\theta_1 = \theta_2$. This produces a stable, well-conditioned solution that doesn’t depend sensitively on which specific correlated feature “wins.”
Under L1, the penalty is indifferent between concentrating all the weight on one feature or splitting it between both, since $|\theta_1| + |\theta_2|$ stays the same either way as long as their sum is fixed and both have the same sign. This means L1 can arbitrarily pick one correlated feature and zero out the other, which can make the resulting sparse solution somewhat unstable or sensitive to small changes in the data or random initialization.

This is precisely the motivation behind Elastic Net, discussed next — it combines L2’s stability under correlated features with L1’s sparsity-inducing behavior.

Elastic Net: Combining Both

Since L1 and L2 each have distinct strengths, Elastic Net regularization combines them in a single weighted penalty:

$$J_{Elastic}(\theta) = J(\theta) + \lambda_1 \sum_i |\theta_i| + \lambda_2 \sum_i \theta_i^2$$

This gives you some of L1’s sparsity-inducing behavior along with L2’s smoother, more stable shrinkage — often useful when you have many correlated features, a scenario where pure L1 can behave somewhat erratically (arbitrarily picking one feature among a correlated group and zeroing out the rest).

Visualizing the Two Penalty Shapes

flowchart LR
    subgraph L1["L1 Penalty (Diamond Constraint)"]
        A1[Sharp corners on axes] --> A2[Solutions often land exactly on axes] --> A3[Produces sparse weights]
    end
    subgraph L2["L2 Penalty (Circular Constraint)"]
        B1[Smooth, round boundary] --> B2[Solutions rarely land on axes] --> B3[Produces small but nonzero weights]
    end

flowchart LR
    subgraph L1["L1 Penalty (Diamond Constraint)"]
        A1[Sharp corners on axes] --> A2[Solutions often land exactly on axes] --> A3[Produces sparse weights]
    end
    subgraph L2["L2 Penalty (Circular Constraint)"]
        B1[Smooth, round boundary] --> B2[Solutions rarely land on axes] --> B3[Produces small but nonzero weights]
    end

Implementing L1 and L2 in Code

Manually, in a training loop (NumPy-style pseudocode):

def compute_gradient_with_l2(grad_data, theta, lam):
    return grad_data + 2 * lam * theta

def compute_gradient_with_l1(grad_data, theta, lam):
    return grad_data + lam * np.sign(theta)

def compute_gradient_with_l2(grad_data, theta, lam):
    return grad_data + 2 * lam * theta

def compute_gradient_with_l1(grad_data, theta, lam):
    return grad_data + lam * np.sign(theta)

In PyTorch, L2 regularization is built directly into most optimizers via the weight_decay argument:

import torch

model = torch.nn.Linear(20, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)  # L2

import torch

model = torch.nn.Linear(20, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)  # L2

L1 regularization isn’t built in the same way and is typically added manually to the loss:

l1_lambda = 1e-4
l1_penalty = sum(p.abs().sum() for p in model.parameters())
loss = criterion(output, target) + l1_lambda * l1_penalty
loss.backward()

l1_lambda = 1e-4
l1_penalty = sum(p.abs().sum() for p in model.parameters())
loss = criterion(output, target) + l1_lambda * l1_penalty
loss.backward()

In Keras, both are available directly on layers:

from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import l1, l2, l1_l2

Dense(64, activation='relu', kernel_regularizer=l2(0.001))       # L2
Dense(64, activation='relu', kernel_regularizer=l1(0.001))       # L1
Dense(64, activation='relu', kernel_regularizer=l1_l2(l1=0.001, l2=0.001))  # Elastic Net

from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import l1, l2, l1_l2

Dense(64, activation='relu', kernel_regularizer=l2(0.001))       # L2
Dense(64, activation='relu', kernel_regularizer=l1(0.001))       # L1
Dense(64, activation='relu', kernel_regularizer=l1_l2(l1=0.001, l2=0.001))  # Elastic Net

Advantages and Disadvantages

L1 Regularization

Advantages:

Produces sparse models, useful for automatic feature selection and interpretability.
Can reduce model size and inference cost when many weights become exactly zero.
More robust to irrelevant or noisy features, since it can eliminate them entirely.

Disadvantages:

Non-differentiable at zero, which requires special optimization handling (e.g., subgradient methods or proximal operators).
Can behave unpredictably with strongly correlated features, arbitrarily selecting one and zeroing out others.
Sparse solutions aren’t always desirable, especially when many small-but-relevant features genuinely contribute to prediction.

L2 Regularization

Advantages:

Smooth, fully differentiable, and easy to optimize with standard gradient-based methods.
Tends to produce more stable, well-conditioned solutions, particularly with correlated features.
Directly corresponds to a Gaussian prior on weights in a Bayesian interpretation, which is often a reasonable default assumption.

Disadvantages:

Doesn’t perform feature selection; all features remain in the model, just with smaller weights.
Can still leave the model larger and less interpretable than an equivalent L1-regularized model.
Sensitive to feature scaling — poorly scaled inputs can cause the penalty to disproportionately affect certain weights.

Real-World Use Cases

L1: Used in high-dimensional settings like genomics (identifying a small subset of relevant genes among thousands), sparse linear models, and compressed sensing applications where interpretability and feature selection matter.
L2: The default choice in most deep learning contexts (as “weight decay”), used broadly across CNNs, transformers, and virtually all standard neural network training to prevent large weight magnitudes without discarding features.
Elastic Net: Common in situations with many correlated predictors, such as certain genomics and econometrics applications, where you want some sparsity but also stability against correlated features.

Best Practices

Default to L2 (weight decay) for most deep learning tasks; it’s simpler to optimize and generally more stable.
Reach for L1 specifically when you want automatic feature selection or a sparse, more interpretable model.
Consider Elastic Net when your features are highly correlated and pure L1 feels unstable or arbitrary in what it selects.
Always tune the regularization strength $\lambda$ using a validation set — too large a value causes underfitting, too small provides negligible regularization benefit.
Standardize or normalize your input features before applying L1/L2 regularization, since penalty terms are sensitive to the scale of the weights, which in turn depends on the scale of the inputs.
Remember that L2 regularization and “weight decay” are mathematically equivalent for standard SGD, but subtly different when combined with adaptive optimizers like Adam — use AdamW if you want the cleaner, decoupled version of weight decay.

Frequently Asked Questions

Can I use L1 and L2 regularization together? Yes — this is exactly what Elastic Net does, combining both penalties with separate strength hyperparameters. It’s particularly useful when you want some sparsity but also want to guard against the instability L1 alone can exhibit with correlated features.

Does L1 always produce a sparser model than L2, no matter the setting? In the vast majority of practical cases, yes — this is essentially the defining characteristic of the L1 penalty, rooted in the geometry and gradient behavior described earlier. There are edge cases (e.g., extremely small regularization strength, or unusual loss landscapes) where the difference in sparsity may be negligible, but as a general rule, L1’s sparsity-inducing property is highly reliable.

Why is L2 called “weight decay” in deep learning but not usually called that in classical statistics? The term “weight decay” specifically describes the effect of L2 regularization on the gradient descent update rule — the multiplicative shrinkage applied to weights on every step. In classical statistics and simpler linear models, L2 regularization is more often described in terms of its role in the loss function itself (as “ridge regression”) rather than its effect on an iterative optimization procedure, since many classical methods solve for the regularized solution in closed form rather than through gradient-based iteration.

How do I choose between L1, L2, and Elastic Net for a specific problem? Consider what you value most: if interpretability and automatic feature selection matter most, lean toward L1. If you have many correlated features and want a stable solution, lean toward L2 or Elastic Net. If you’re working with a standard deep neural network without a strong need for sparsity, L2 (weight decay) is almost always the simpler and more standard default.

Summary

L1 and L2 regularization both discourage overly large weights, but they do so in fundamentally different geometric ways: L2 applies a smooth, proportional shrinkage that rarely zeroes out weights entirely, while L1 applies constant pressure that frequently drives weights to exactly zero, producing sparse, more interpretable models. Neither is universally “better” — the right choice depends on whether you want automatic feature selection (L1), smoother and more stable optimization (L2), or a hybrid of both (Elastic Net). Understanding the underlying math and geometry, rather than just treating them as interchangeable “add a penalty” tricks, will help you choose the right one for your specific modeling problem.

References and Further Reading

Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society.
Hoerl, A. E., & Kennard, R. W. (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics.
Zou, H., & Hastie, T. (2005). “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). “Deep Learning,” Chapter 7: Regularization for Deep Learning.
Scikit-learn regularization documentation: https://scikit-learn.org/stable/modules/linear_model.html

Trending Top-Read Articles

The Mathematical Foundations of Viral Propagation: A Forensic Analysis of Gleissner’s 1989 Theory

A Computational Model of Computer Virus Propagation: A 2026 Forensic Retrospective

A Comprehensive Program for Preventing and Detecting Computer Viruses: A 2026 Forensic Retrospective on the IRS Security Crisis of 2000

Trending Tags

L1 vs. L2 Regularization: What’s the Difference?

The Basic Idea of Regularization

L2 Regularization (Ridge / Weight Decay)

L1 Regularization (Lasso)

The Key Difference: Sparsity

Geometric Intuition

Side-by-Side Comparison

The Bayesian Interpretation: Priors on Weights

What Happens as $\lambda$ Increases: A Walkthrough

Correlated Features: Where the Two Penalties Diverge Most

Elastic Net: Combining Both

Visualizing the Two Penalty Shapes

Implementing L1 and L2 in Code

Advantages and Disadvantages

Real-World Use Cases

Best Practices

Frequently Asked Questions

Summary

References and Further Reading

Like this:

Leave a ReplyCancel reply

Previous Post

The Role of L1 and L2 Regularization in Neural Networks

Next Post

Dropout Regularization: What It Is and How It Works

L1 vs. L2 Regularization: What’s the Difference?

The Basic Idea of Regularization

L2 Regularization (Ridge / Weight Decay)

L1 Regularization (Lasso)

The Key Difference: Sparsity

Geometric Intuition

Side-by-Side Comparison

The Bayesian Interpretation: Priors on Weights

What Happens as $\lambda$ Increases: A Walkthrough

Correlated Features: Where the Two Penalties Diverge Most

Elastic Net: Combining Both

Visualizing the Two Penalty Shapes

Implementing L1 and L2 in Code

Advantages and Disadvantages

Real-World Use Cases

Best Practices

Frequently Asked Questions

Summary

References and Further Reading

Like this:

Leave a ReplyCancel reply

Previous Post

Next Post

Related Posts