Regularized least squares
Part of a series on 
Regression analysis 

Models 
Estimation 
Background 

Regularized least squares (RLS) is a family of methods for solving the leastsquares problem while using regularization to further constrain the resulting solution.
RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary leastsquares problem is illposed and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution.
The second reason for using RLS arises when the learned model suffers from poor generalization. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be "sparse" in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A Bayesian understanding of this can be reached by showing that RLS methods are often equivalent to priors on the solution to the leastsquares problem.
General formulation
Consider a learning setting given by a probabilistic space [math]\displaystyle{ (X \times Y, \rho(X,Y)) }[/math], [math]\displaystyle{ Y \in R }[/math]. Let [math]\displaystyle{ S=\{x_i,y_i\}_{i=1}^n }[/math] denote a training set of [math]\displaystyle{ n }[/math] pairs i.i.d. with respect to [math]\displaystyle{ \rho }[/math]. Let [math]\displaystyle{ V:Y \times R \rightarrow [0;\infty) }[/math] be a loss function. Define [math]\displaystyle{ F }[/math] as the space of the functions such that expected risk:
 [math]\displaystyle{ \varepsilon(f) = \int V(y,f(x)) \, d\rho(x,y) }[/math]
is well defined. The main goal is to minimize the expected risk:
 [math]\displaystyle{ \inf_{f \in F}\varepsilon(f) }[/math]
Since the problem cannot be solved exactly there is a need to specify how to measure the quality of a solution. A good learning algorithm should provide an estimator with a small risk.
As the joint distribution [math]\displaystyle{ \rho }[/math] is typically unknown, the empirical risk is taken. For regularized least squares the square loss function is introduced:
 [math]\displaystyle{ \varepsilon(f) = \frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i)) = \frac{1}{n}\sum_{i=1}^n(y_if(x_i))^2 }[/math]
However, if the functions are from a relatively unconstrained space, such as the set of squareintegrable functions on [math]\displaystyle{ X }[/math], this approach may overfit the training data, and lead to poor generalization. Thus, it should somehow constrain or penalize the complexity of the function [math]\displaystyle{ f }[/math]. In RLS, this is accomplished by choosing functions from a reproducing kernel Hilbert space (RKHS) [math]\displaystyle{ \mathcal {H} }[/math], and adding a regularization term to the objective function, proportional to the norm of the function in [math]\displaystyle{ \mathcal {H} }[/math]:
 [math]\displaystyle{ \inf_{f \in F}\varepsilon(f) + \lambda R(f), \lambda \gt 0 }[/math]
Kernel formulation
Definition of RKHS
A RKHS can be defined by a symmetric positivedefinite kernel function [math]\displaystyle{ K(x,z) }[/math] with the reproducing property:
 [math]\displaystyle{ \langle K_x,f\rangle_{\mathcal{H}}=f(x), }[/math]
where [math]\displaystyle{ K_x(z)=K(x,z) }[/math]. The RKHS for a kernel [math]\displaystyle{ K }[/math] consists of the completion of the space of functions spanned by [math]\displaystyle{ \left\{ K_x\mid x \in X\right\} }[/math]: [math]\displaystyle{ f(x)=\sum_{i=1}^n \alpha_i K_{x_i}(x),\, f\in\mathcal{H} }[/math], where all [math]\displaystyle{ \alpha_i }[/math] are real numbers. Some commonly used kernels include the linear kernel, inducing the space of linear functions:
 [math]\displaystyle{ K(x,z)=x^T z, }[/math]
the polynomial kernel, inducing the space of polynomial functions of order [math]\displaystyle{ d }[/math]:
 [math]\displaystyle{ K(x,z)=(x^T z+1)^d, }[/math]
and the Gaussian kernel:
 [math]\displaystyle{ K(x,z)=e^{\frac{\xz\^2}{\sigma^2}}. }[/math]
Note that for an arbitrary loss function [math]\displaystyle{ V }[/math], this approach defines a general class of algorithms named Tikhonov regularization. For instance, using the hinge loss leads to the support vector machine algorithm, and using the epsiloninsensitive loss leads to support vector regression.
Arbitrary kernel
The representer theorem guarantees that the solution can be written as:
 [math]\displaystyle{ f(x) = \sum_{i=1}^n c_i K(x_i,x) }[/math] for some [math]\displaystyle{ c \in \mathbb R^n }[/math].
The minimization problem can be expressed as:
 [math]\displaystyle{ \min_{c \in\mathbb R^n}\frac{1}{n}\YKc\^2_{\mathbb R^n} + \lambda\f\^2_H, }[/math]
where, with some abuse of notation, the [math]\displaystyle{ i,j }[/math] entry of kernel matrix [math]\displaystyle{ K }[/math] (as opposed to kernel function [math]\displaystyle{ K(\cdot, \cdot) }[/math]) is [math]\displaystyle{ K(x_i, x_j) }[/math].
For such a function,
 [math]\displaystyle{ \begin{align} & \f\^2_H = \langle f,f \rangle_{H} =\left\langle \sum_{i=1}^n c_i K(x_i,\cdot), \sum_{j=1}^n c_j K(x_{j},\cdot) \right\rangle_H \\ = {} & \sum_{i=1}^n \sum_{j=1}^n c_i c_j \langle K(x_i,\cdot), K(x_j,\cdot) \rangle_H = \sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i,x_j) = c^T Kc, \end{align} }[/math]
The following minimization problem can be obtained:
 [math]\displaystyle{ \min_{c \in\mathbb R^n}\frac{1}{n}\YKc\^2_{\mathbb R^n} + \lambda c^T Kc }[/math].
As the sum of convex functions is convex, the solution is unique and its minimum can be found by setting the gradient w.r.t [math]\displaystyle{ c }[/math] to [math]\displaystyle{ 0 }[/math]:
 [math]\displaystyle{ \frac{1}{n}K(YKc) + \lambda Kc = 0 \Rightarrow K(K+\lambda n I)c = K Y \Rightarrow c = (K+\lambda n I)^{1}Y, }[/math]
where [math]\displaystyle{ c \in\mathbb R^n. }[/math]
Complexity
The complexity of training is basically the cost of computing the kernel matrix plus the cost of solving the linear system which is roughly [math]\displaystyle{ O(n^3) }[/math]. The computation of the kernel matrix for the linear or Gaussian kernel is [math]\displaystyle{ O(n^2 D) }[/math]. The complexity of testing is [math]\displaystyle{ O(n) }[/math].
Prediction
The prediction at a new test point [math]\displaystyle{ x_{*} }[/math] is:
 [math]\displaystyle{ f(x_{*}) = \sum_{i=1}^n c_i K(x_i,x_{*}) = K(X,X_{*})^T c }[/math]
Linear kernel
For convenience a vector notation is introduced. Let [math]\displaystyle{ X }[/math] be an [math]\displaystyle{ n\times d }[/math] matrix, where the rows are input vectors, and [math]\displaystyle{ Y }[/math] a [math]\displaystyle{ n\times 1 }[/math] vector where the entries are corresponding outputs. In terms of vectors, the kernel matrix can be written as [math]\displaystyle{ \operatorname K=\operatorname X\operatorname X^T }[/math]. The learning function can be written as:
 [math]\displaystyle{ f(x_{*}) = \operatorname K_{x_{*}}c = x_{*}^T \operatorname X^T c = x_{*}^T w }[/math]
Here we define [math]\displaystyle{ w = X^T c, w \in R^d }[/math]. The objective function can be rewritten as:
 [math]\displaystyle{ \begin{align} & \frac{1}{n}\Y\operatorname Kc\^2_{\mathbb R^n}+\lambda c^T \operatorname Kc \\[4pt] = {} & \frac{1}{n}\y\operatorname X\operatorname X^T c\^2_{\mathbb R^n}+\lambda c^T \operatorname X\operatorname X^T c = \frac{1}{n}\y\operatorname Xw\^2_{\mathbb R^n}+\lambda \w\^2_{\mathbb R^d} \end{align} }[/math]
The first term is the objective function from ordinary least squares (OLS) regression, corresponding to the residual sum of squares. The second term is a regularization term, not present in OLS, which penalizes large [math]\displaystyle{ w }[/math] values. As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools. In order to minimize the objective function, the gradient is calculated with respect to [math]\displaystyle{ w }[/math] and set it to zero:
 [math]\displaystyle{ \operatorname X^T \operatorname Xw\operatorname X^T y+\lambda n w=0 }[/math]
 [math]\displaystyle{ w=(\operatorname X^T \operatorname X+\lambda n \operatorname I)^{1}\operatorname X^T y }[/math]
This solution closely resembles that of standard linear regression, with an extra term [math]\displaystyle{ \lambda\operatorname I }[/math]. If the assumptions of OLS regression hold, the solution [math]\displaystyle{ w=(\operatorname X^T\operatorname X)^{1}\operatorname X^T y }[/math], with [math]\displaystyle{ \lambda=0 }[/math], is an unbiased estimator, and is the minimumvariance linear unbiased estimator, according to the Gauss–Markov theorem. The term [math]\displaystyle{ \lambda n \operatorname I }[/math] therefore leads to a biased solution; however, it also tends to reduce variance. This is easy to see, as the covariance matrix of the [math]\displaystyle{ w }[/math]values is proportional to [math]\displaystyle{ (\operatorname X^T \operatorname X+\lambda n \operatorname I)^{1} }[/math], and therefore large values of [math]\displaystyle{ \lambda }[/math] will lead to lower variance. Therefore, manipulating [math]\displaystyle{ \lambda }[/math] corresponds to tradingoff bias and variance. For problems with highvariance [math]\displaystyle{ w }[/math] estimates, such as cases with relatively small [math]\displaystyle{ n }[/math] or with correlated regressors, the optimal prediction accuracy may be obtained by using a nonzero [math]\displaystyle{ \lambda }[/math], and thus introducing some bias to reduce variance. Furthermore, it is not uncommon in machine learning to have cases where [math]\displaystyle{ n\lt d }[/math], in which case [math]\displaystyle{ X^T X }[/math] is rankdeficient, and a nonzero [math]\displaystyle{ \lambda }[/math] is necessary to compute [math]\displaystyle{ (\operatorname X^T \operatorname X+\lambda n \operatorname I)^{1} }[/math].
Complexity
The parameter [math]\displaystyle{ \lambda }[/math] controls the invertibility of the matrix [math]\displaystyle{ X^T X + \lambda n I }[/math]. Several methods can be used to solve the above linear system, Cholesky decomposition being probably the method of choice, since the matrix [math]\displaystyle{ X^T X + \lambda n I }[/math] is symmetric and positive definite. The complexity of this method is [math]\displaystyle{ O(nD^2) }[/math] for training and [math]\displaystyle{ O(D) }[/math] for testing. The cost [math]\displaystyle{ O(nD^2) }[/math] is essentially that of computing [math]\displaystyle{ X^T X }[/math], whereas the inverse computation (or rather the solution of the linear system) is roughly [math]\displaystyle{ O(D^3) }[/math].
Feature maps and Mercer's theorem
In this section it will be shown how to extend RLS to any kind of reproducing kernel K. Instead of linear kernel a feature map is considered [math]\displaystyle{ \Phi: X \rightarrow F }[/math] for some Hilbert space [math]\displaystyle{ F }[/math], called the feature space. In this case the kernel is defined as: The matrix [math]\displaystyle{ X }[/math] is now replaced by the new data matrix [math]\displaystyle{ \Phi }[/math], where [math]\displaystyle{ \Phi_{ij} = \varphi_j(x_i) }[/math], or the [math]\displaystyle{ j }[/math]th component of the [math]\displaystyle{ \varphi(x_i) }[/math].
 [math]\displaystyle{ K(x,x') = \langle \Phi(x), \Phi(x') \rangle_F. }[/math]
It means that for a given training set [math]\displaystyle{ K = \Phi \Phi^T }[/math]. Thus, the objective function can be written as
 [math]\displaystyle{ \min_{c \in \mathbb R^n}\Y  \Phi \Phi^T c \^2_{\mathbb R^n} + \lambda c^T \Phi \Phi^T c. }[/math]
This approach is known as the kernel trick. This technique can significantly simplify the computational operations. If [math]\displaystyle{ F }[/math] is high dimensional, computing [math]\displaystyle{ \varphi(x_i) }[/math] may be rather intensive. If the explicit form of the kernel function is known, we just need to compute and store the [math]\displaystyle{ n\times n }[/math] kernel matrix [math]\displaystyle{ \operatorname K }[/math].
In fact, the Hilbert space [math]\displaystyle{ F }[/math] need not be isomorphic to [math]\displaystyle{ \mathbb{R}^m }[/math], and can be infinite dimensional. This follows from Mercer's theorem, which states that a continuous, symmetric, positive definite kernel function can be expressed as
 [math]\displaystyle{ K(x,z)=\sum_{i=1}^\infty \sigma_i e_i(x) e_i(z) }[/math]
where [math]\displaystyle{ e_i(x) }[/math] form an orthonormal basis for [math]\displaystyle{ \ell^2(X) }[/math], and [math]\displaystyle{ \sigma_i \in\mathbb{R} }[/math]. If feature maps is defined [math]\displaystyle{ \varphi(x) }[/math] with components [math]\displaystyle{ \varphi_i(x)=\sqrt{\sigma_i}e_i(x) }[/math], it follows that [math]\displaystyle{ K(x,z)=\langle\varphi(x),\varphi(z)\rangle }[/math]. This demonstrates that any kernel can be associated with a feature map, and that RLS generally consists of linear RLS performed in some possibly higherdimensional feature space. While Mercer's theorem shows how one feature map that can be associated with a kernel, in fact multiple feature maps can be associated with a given reproducing kernel. For instance, the map [math]\displaystyle{ \varphi(x)=K_x }[/math] satisfies the property [math]\displaystyle{ K(x,z)=\langle\varphi(x), \varphi(z) \rangle }[/math] for an arbitrary reproducing kernel.
Bayesian interpretation
Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals. This is because the exponent of the Gaussian distribution is quadratic in the data, and so is the leastsquares objective function. In this framework, the regularization terms of RLS can be understood to be encoding priors on [math]\displaystyle{ w }[/math]. For instance, Tikhonov regularization corresponds to a normally distributed prior on [math]\displaystyle{ w }[/math] that is centered at 0. To see this, first note that the OLS objective is proportional to the loglikelihood function when each sampled [math]\displaystyle{ y^i }[/math] is normally distributed around [math]\displaystyle{ w^T \cdot x^i }[/math]. Then observe that a normal prior on [math]\displaystyle{ w }[/math] centered at 0 has a logprobability of the form
 [math]\displaystyle{ \log P(w) = q  \alpha \sum_{j=1}^d w_j^2 }[/math]
where [math]\displaystyle{ q }[/math] and [math]\displaystyle{ \alpha }[/math] are constants that depend on the variance of the prior and are independent of [math]\displaystyle{ w }[/math]. Thus, minimizing the logarithm of the likelihood times the prior is equivalent to minimizing the sum of the OLS loss function and the ridge regression regularization term.
This gives a more intuitive interpretation for why Tikhonov regularization leads to a unique solution to the leastsquares problem: there are infinitely many vectors [math]\displaystyle{ w }[/math] satisfying the constraints obtained from the data, but since we come to the problem with a prior belief that [math]\displaystyle{ w }[/math] is normally distributed around the origin, we will end up choosing a solution with this constraint in mind.
Other regularization methods correspond to different priors. See the list below for more details.
Specific examples
Ridge regression (or Tikhonov regularization)
One particularly common choice for the penalty function [math]\displaystyle{ R }[/math] is the squared [math]\displaystyle{ \ell 2 }[/math] norm, i.e.,
 [math]\displaystyle{ R(w) = \sum_{j=1}^d w_j^2 }[/math]
 [math]\displaystyle{ \frac{1}{n}\Y\operatorname Xw\^2_2+\lambda \sum_{j=1}^d w_j^2 \rightarrow \min_{w \in \mathbb{R}^d} }[/math]
The most common names for this are called Tikhonov regularization and ridge regression. It admits a closedform solution for [math]\displaystyle{ w }[/math]:
 [math]\displaystyle{ w = (X^T X + \lambda I)^{1} X^T Y }[/math]
The name ridge regression alludes to the fact that the [math]\displaystyle{ \lambda I }[/math] term adds positive entries along the diagonal "ridge" of the sample covariance matrix [math]\displaystyle{ X^T X }[/math].
When [math]\displaystyle{ \lambda=0 }[/math], i.e., in the case of ordinary least squares, the condition that [math]\displaystyle{ d \gt n }[/math] causes the sample covariance matrix [math]\displaystyle{ X^T X }[/math] to not have full rank and so it cannot be inverted to yield a unique solution. This is why there can be an infinitude of solutions to the ordinary least squares problem when [math]\displaystyle{ d \gt n }[/math]. However, when [math]\displaystyle{ \lambda \gt 0 }[/math], i.e., when ridge regression is used, the addition of [math]\displaystyle{ \lambda I }[/math] to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0. In other words, it becomes invertible, and the solution becomes unique.
Compared to ordinary least squares, ridge regression is not unbiased. It accepts bias to reduce variance and the mean square error.
Lasso regression
The least absolute selection and shrinkage (LASSO) method is another popular choice. In lasso regression, the lasso penalty function [math]\displaystyle{ R }[/math] is the [math]\displaystyle{ \ell 1 }[/math] norm, i.e.
 [math]\displaystyle{ R(w) = \sum_{j=1}^d \left w_j \right }[/math]
 [math]\displaystyle{ \frac{1}{n}\Y\operatorname Xw\^2_2+\lambda \sum_{j=1}^d w_j \rightarrow \min_{w \in \mathbb{R}^d} }[/math]
Note that the lasso penalty function is convex but not strictly convex. Unlike Tikhonov regularization, this scheme does not have a convenient closedform solution: instead, the solution is typically found using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the leastangle regression algorithm.
An important difference between lasso regression and Tikhonov regularization is that lasso regression forces more entries of [math]\displaystyle{ w }[/math] to actually equal 0 than would otherwise. In contrast, while Tikhonov regularization forces entries of [math]\displaystyle{ w }[/math] to be small, it does not force more of them to be 0 than would be otherwise. Thus, LASSO regularization is more appropriate than Tikhonov regularization in cases in which we expect the number of nonzero entries of [math]\displaystyle{ w }[/math] to be small, and Tikhonov regularization is more appropriate when we expect that entries of [math]\displaystyle{ w }[/math] will generally be small but not necessarily zero. Which of these regimes is more relevant depends on the specific data set at hand.
Besides feature selection described above, LASSO has some limitations. Ridge regression provides better accuracy in the case [math]\displaystyle{ n \gt d }[/math] for highly correlated variables.^{[1]} In another case, [math]\displaystyle{ n \lt d }[/math], LASSO selects at most [math]\displaystyle{ n }[/math] variables. Moreover, LASSO tends to select some arbitrary variables from group of highly correlated samples, so there is no grouping effect.
ℓ_{0} Penalization
 [math]\displaystyle{ \frac{1}{n}\Y\operatorname Xw\^2_2+\lambda \w_j\_0 \rightarrow \min_{w \in \mathbb{R}^d} }[/math]
The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of [math]\displaystyle{ w }[/math] does not matter; rather, the only thing that determines the complexity of [math]\displaystyle{ w }[/math] is the number of nonzero entries. This corresponds to setting [math]\displaystyle{ R(w) }[/math] to be the [math]\displaystyle{ \ell 0 }[/math] norm of [math]\displaystyle{ w }[/math]. This regularization function, while attractive for the sparsity that it guarantees, is very difficult to solve because doing so requires optimization of a function that is not even weakly convex. Lasso regression is the minimal possible relaxation of [math]\displaystyle{ \ell_0 }[/math] penalization that yields a weakly convex optimization problem.
Elastic net
For any nonnegative [math]\displaystyle{ \lambda_1 }[/math] and [math]\displaystyle{ \lambda_2 }[/math] the objective has the following form:
 [math]\displaystyle{ \frac{1}{n}\Y\operatorname Xw\^2_2+\lambda_1 \sum_{j=1}^d w_j + \lambda_2 \sum_{j=1}^d w_j^2 \rightarrow \min_{w \in \mathbb{R}^d} }[/math]
Let [math]\displaystyle{ \alpha = \frac{\lambda_1}{\lambda_1 + \lambda_2} }[/math], then the solution of the minimization problem is described as:
 [math]\displaystyle{ \frac{1}{n}\Y\operatorname Xw\^2_2 \rightarrow \min_{w \in \mathbb{R}^d} \text{ s.t. } (1\alpha)\w\_1 + \alpha \w\_2 \leq t }[/math] for some [math]\displaystyle{ t }[/math].
Consider [math]\displaystyle{ (1\alpha)\w\_1 + \alpha \w\_2 \leq t }[/math] as an Elastic Net penalty function.
When [math]\displaystyle{ \alpha = 1 }[/math], elastic net becomes ridge regression, whereas [math]\displaystyle{ \alpha = 0 }[/math] it becomes Lasso. [math]\displaystyle{ \forall \alpha \in (0,1] }[/math] Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex [math]\displaystyle{ \forall \alpha \gt 0 }[/math] taking the properties both lasso regression and ridge regression.
One of the main properties of the Elastic Net is that it can select groups of correlated variables. The difference between weight vectors of samples [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] is given by:
 [math]\displaystyle{ w^{*}_i(\lambda_1, \lambda_2)  w^{*}_j(\lambda_1, \lambda_2) \leq \frac{\sum_{i=1}^ny_i}{\lambda_2}\sqrt{2(1\rho_{ij})} }[/math], where [math]\displaystyle{ \rho_{ij} = x_i^T x_j }[/math].^{[2]}
If [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are highly correlated ( [math]\displaystyle{ \rho_{ij} \rightarrow 1 }[/math]), the weight vectors are very close. In the case of negatively correlated samples ( [math]\displaystyle{ \rho_{ij} \rightarrow 1 }[/math]) the samples [math]\displaystyle{ x_j }[/math] can be taken. To summarize, for highly correlated variables the weight vectors tend to be equal up to a sign in the case of negative correlated variables.
Partial list of RLS methods
The following is a list of possible choices of the regularization function [math]\displaystyle{ R(\cdot) }[/math], along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.
Name  Regularization function  Corresponding prior  Methods for solving 

Tikhonov regularization  [math]\displaystyle{ \ w \_2^2 }[/math]  Normal  Closed form 
Lasso regression  [math]\displaystyle{ \ w \_1 }[/math]  Laplace  Proximal gradient descent, least angle regression 
[math]\displaystyle{ \ell_0 }[/math] penalization  [math]\displaystyle{ \w \_0 }[/math]  –  Forward selection, Backward elimination, use of priors such as spike and slab 
Elastic nets  [math]\displaystyle{ \beta \w\_1 + (1\beta) \w \_2^2 }[/math]  Normal and Laplace mixture  Proximal gradient descent 
Total variation regularization  [math]\displaystyle{ \sum_{j=1}^{d1}  w_{j+1}  w_j  }[/math]  –  Split–Bregman method, among others 
See also
 Least squares
 Regularization in mathematics.
 Generalization error, one of the reasons regularization is used.
 Tikhonov regularization
 Lasso regression
 Elastic net regularization
 :Leastangle regression
References
 ↑ Tibshirani Robert (1996). "Regression shrinkage and selection via the lasso". Journal of the Royal Statistical Society, Series B 58: pp. 266–288. https://web.stanford.edu/~hastie/Papers/elasticnet.pdf.
 ↑ Hui, Zou; Hastie, Trevor (2003). "Regularization and Variable Selection via the Elastic Net". Journal of the Royal Statistical Society, Series B 67 (2): pp. 301–320. https://web.stanford.edu/~hastie/Papers/elasticnet.pdf.
External links
 http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf Regularization and Variable Selection via the Elastic Net (presentation)
 Regularized Least Squares and Support Vector Machines (presentation)
 Regularized Least Squares(presentation)
Original source: https://en.wikipedia.org/wiki/Regularized least squares.
Read more 