From Quora https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization

Justin Solomon has a great answer on the difference between L1 and L2 norms and the implications for regularization.

*ℓ1*** vs ***ℓ2*** for signal estimation:**

Here is what a signal that is sparse or approximately sparse i.e. that belongs to the ell-1 ball looks like. It becomes extremely unlikely that an ** ℓ2** penalty can recover a sparse signal since very few solutions of such a cost function are truly sparse.

**penalties on the other hand are great for recovering truly sparse signals, as they are computationally tractable but still capable of recovering the exact sparse solution.**

*ℓ1***penalization is preferable for data that is not at all sparse, i.e. where you do not expect regression coefficients to show a decaying property. In such cases, incorrectly using an**

*ℓ2***penalty for non-sparse data will give you give you a large estimation error.**

*ℓ1**Figure: ℓp ball. As the value of p decreases, the size of the corresponding ℓp space also decreases. This can be seen visually when comparing the the size of the spaces of signals, in three dimensions, for which the ℓp norm is less than or equal to one. The volume of these ℓp “balls” decreases with p. [2]*

*ℓ1*** vs ***ℓ2*** for prediction:**

Typically ridge or ** ℓ2** penalties are much better for minimizing prediction error rather than

**penalties. The reason for this is that when two predictors are highly correlated,**

*ℓ1***regularizer will simply pick one of the two predictors. In contrast, the**

*ℓ1***regularizer will keep both of them and jointly shrink the corresponding coefficients a little bit. Thus, while the**

*ℓ2***penalty can certainly reduce overfitting, you may also experience a loss in predictive power.**

*ℓ1***A Clarification on ***ℓ1***-regularization for Exact Sparse Signal Recovery:**

However I want to comment on a frequently used analogy that ** ℓ1**-regularization is *equivalent* to MAP estimation using Laplacian priors. The notion of equivalence here is very subtle.

Remember if the true signal is sparse its coefficients have exactly kk non-zeros or and approximately sparse if kk really large coefficients and with the rest of the coefficients decaying to zero quickly. ** ℓ1** regularization doesn’t merely encourage sparse solutions, but is capable of exactly recovering a signal that is sparse.

Between 1999-2005, many exciting results in statistics and signal processing [3-6] demonstrated that if the underlying signal was extremely sparse and the design matrix satisfied certain conditions the solution to ** ℓ1**-regularized objective would coincide with the

**-regularized (best subset selection) objective, despite having an overall under-determined and high dimensional problem. This would not be possible with**

*ℓ0***regularization.**

*ℓ2*An analogous question when performing MAP estimation using laplacian priors would be,

**“What class of signals does such a cost function recover accurately ?”**

The bottom line here is that geometric intuition that ** ℓ1**-regularization is *like* laplacian regularized MAP does not mean that laplacian distributions can be used to describe sparse or compressible signals.

A recent paper by Gribonval, et al. [1] demonstrated the following

many distributions revolving around

maximum a posteriori (MAP) interpretation of sparse regularized

estimators are in fact incompressible, in the limit of large problem

sizes. We especially highlight the Laplace distribution and ℓ1

regularized estimators such as the Lasso and Basis Pursuit

denoising. We rigorously disprove the myth that the success of

ℓ1 minimization for compressed sensing image reconstruction

is a simple corollary of a Laplace model of images combined

with Bayesian MAP estimation, and show that in fact quite the

reverse is true.

**This paper [1] proves that many instances of signals drawn from a laplacian distribution are simply not sparse enough to be good candidates for l1 like recovery. In fact such signals are better estimated using ordinary least squares! An illustration of Fig. 4 from the paper is provided below.**

Update: All the theoretical results show that sparse or approximately sparse signals can be recovered effectively by minimizing an ** ℓ1**-regularized cost function. But you cannot assume that just because laplacian priors have a “sparsifying” property when used in a cost function that one can use the same distribution as a generative model for the signal.

Aleks Jakulin pointed in the comments, that it is not a standard assumption in Bayesian statistics to assume that the data is drawn from the prior. However, in areas outside statistics, particularly information theoretic fields, one often assumes that the source/signal is drawn from some distribution. The Gribonval result was an important clarification for those who strongly care about the equivalence of ** ℓ0**–

**solutions in signal processing and communication theory—That the**

*ℓ1***theoretical results for exact recovery of sparse signals do not apply if you assume that the geometric intuition of the compressible signal belonging to the l1-ball (see below) is equivalent to probabilistic or generative model interpretation that the source signal is iid Laplacian.**

[1] http://arxiv.org/pdf/1102.1249v3…

[2] Compressible signals

[3] Compressed sensing

[4]Uncertainty principles and ideal atomic decomposition

[5]Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information