What is the difference between L1 and L2 regularization?

From Quora https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization

Justin Solomon has a great answer on the difference between L1 and L2 norms and the implications for regularization.

ℓ1 vs ℓ2 for signal estimation:
Here is what a signal that is sparse or approximately sparse i.e. that belongs to the ell-1 ball looks like. It becomes extremely unlikely that an ℓ2 penalty can recover a sparse signal since very few solutions of such a cost function are truly sparse. ℓ1 penalties on the other hand are great for recovering truly sparse signals, as they are computationally tractable but still capable of recovering the exact sparse solution. ℓ2 penalization is preferable for data that is not at all sparse, i.e. where you do not expect regression coefficients to show a decaying property. In such cases, incorrectly using an ℓ1 penalty for non-sparse data will give you give you a large estimation error.

Figure: ℓp ball. As the value of p decreases, the size of the corresponding ℓp space also decreases. This can be seen visually when comparing the the size of the spaces of signals, in three dimensions, for which the ℓp norm is less than or equal to one. The volume of these ℓp “balls” decreases with p. [2]

ℓ1 vs ℓ2 for prediction:
Typically ridge or ℓ2 penalties are much better for minimizing prediction error rather than ℓ1 penalties. The reason for this is that when two predictors are highly correlated, ℓ1regularizer will simply pick one of the two predictors. In contrast, the ℓ2 regularizer will keep both of them and jointly shrink the corresponding coefficients a little bit. Thus, while the ℓ1 penalty can certainly reduce overfitting, you may also experience a loss in predictive power.

A Clarification on ℓ1-regularization for Exact Sparse Signal Recovery:
However I want to comment on a frequently used analogy that ℓ1-regularization is *equivalent* to MAP estimation using Laplacian priors. The notion of equivalence here is very subtle.

Remember if the true signal is sparse its coefficients have exactly kk non-zeros or and approximately sparse if kk really large coefficients and with the rest of the coefficients decaying to zero quickly. ℓ1 regularization doesn’t merely encourage sparse solutions, but is capable of exactly recovering a signal that is sparse.

Between 1999-2005, many exciting results in statistics and signal processing [3-6] demonstrated that if the underlying signal was extremely sparse and the design matrix satisfied certain conditions the solution to ℓ1-regularized objective would coincide with the ℓ0-regularized (best subset selection) objective, despite having an overall under-determined and high dimensional problem. This would not be possible with ℓ2regularization.

An analogous question when performing MAP estimation using laplacian priors would be,

“What class of signals does such a cost function recover accurately ?”

The bottom line here is that geometric intuition that ℓ1-regularization is *like* laplacian regularized MAP does not mean that laplacian distributions can be used to describe sparse or compressible signals.

A recent paper by Gribonval, et al. [1] demonstrated the following

many distributions revolving around
maximum a posteriori (MAP) interpretation of sparse regularized
estimators are in fact incompressible, in the limit of large problem
sizes. We especially highlight the Laplace distribution and ℓ1
regularized estimators such as the Lasso and Basis Pursuit
denoising. We rigorously disprove the myth that the success of
ℓ1 minimization for compressed sensing image reconstruction
is a simple corollary of a Laplace model of images combined
with Bayesian MAP estimation, and show that in fact quite the
reverse is true.

This paper [1] proves that many instances of signals drawn from a laplacian distribution are simply not sparse enough to be good candidates for l1 like recovery. In fact such signals are better estimated using ordinary least squares! An illustration of Fig. 4 from the paper is provided below.

Update: All the theoretical results show that sparse or approximately sparse signals can be recovered effectively by minimizing an ℓ1-regularized cost function. But you cannot assume that just because laplacian priors have a “sparsifying” property when used in a cost function that one can use the same distribution as a generative model for the signal.

Aleks Jakulin pointed in the comments, that it is not a standard assumption in Bayesian statistics to assume that the data is drawn from the prior. However, in areas outside statistics, particularly information theoretic fields, one often assumes that the source/signal is drawn from some distribution. The Gribonval result was an important clarification for those who strongly care about the equivalence of ℓ0ℓ1 solutions in signal processing and communication theory—That the theoretical results for exact recovery of sparse signals do not apply if you assume that the geometric intuition of the compressible signal belonging to the l1-ball (see below) is equivalent to probabilistic or generative model interpretation that the source signal is iid Laplacian.

[1] http://arxiv.org/pdf/1102.1249v3…
[2] Compressible signals
[3] Compressed sensing
[4]Uncertainty principles and ideal atomic decomposition
[5]Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s