Matrix regularization

From HandWiki

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

[math]\displaystyle{ \min_x \|Ax - y\|^2 + \lambda \|x\|^2 }[/math]

to find a vector [math]\displaystyle{ x }[/math] that is a stable solution to the regression problem. When the system is described by a matrix rather than a vector, this problem can be written as

[math]\displaystyle{ \min_X \|AX - Y\|^2 + \lambda \|X\|^2, }[/math]

where the vector norm enforcing a regularization penalty on [math]\displaystyle{ x }[/math] has been extended to a matrix norm on [math]\displaystyle{ X }[/math].

Matrix regularization has applications in matrix completion, multivariate regression, and multi-task learning. Ideas of feature and group selection can also be extended to matrices, and these can be generalized to the nonparametric case of multiple kernel learning.

Basic definition

Consider a matrix [math]\displaystyle{ W }[/math] to be learned from a set of examples, [math]\displaystyle{ S=(X_i^t,y_i^t) }[/math], where [math]\displaystyle{ i }[/math] goes from [math]\displaystyle{ 1 }[/math] to [math]\displaystyle{ n }[/math], and [math]\displaystyle{ t }[/math] goes from [math]\displaystyle{ 1 }[/math] to [math]\displaystyle{ T }[/math]. Let each input matrix [math]\displaystyle{ X_i }[/math] be [math]\displaystyle{ \in \mathbb{R}^{DT} }[/math], and let [math]\displaystyle{ W }[/math] be of size [math]\displaystyle{ D \times T }[/math]. A general model for the output [math]\displaystyle{ y }[/math] can be posed as

[math]\displaystyle{ y_i^t = \langle W, X_i^t \rangle_F, }[/math]

where the inner product is the Frobenius inner product. For different applications the matrices [math]\displaystyle{ X_i }[/math] will have different forms,[1] but for each of these the optimization problem to infer [math]\displaystyle{ W }[/math] can be written as

[math]\displaystyle{ \min_{W \in \mathcal{H}} E(W) + R(W), }[/math]

where [math]\displaystyle{ E }[/math] defines the empirical error for a given [math]\displaystyle{ W }[/math], and [math]\displaystyle{ R(W) }[/math] is a matrix regularization penalty. The function [math]\displaystyle{ R(W) }[/math] is typically chosen to be convex and is often selected to enforce sparsity (using [math]\displaystyle{ \ell^1 }[/math]-norms) and/or smoothness (using [math]\displaystyle{ \ell^2 }[/math]-norms). Finally, [math]\displaystyle{ W }[/math] is in the space of matrices [math]\displaystyle{ \mathcal{H} }[/math] with Frobenius inner product [math]\displaystyle{ \langle \dots \rangle_F }[/math].

General applications

Matrix completion

In the problem of matrix completion, the matrix [math]\displaystyle{ X_i^t }[/math] takes the form

[math]\displaystyle{ X_i^t = e_t \otimes e_i', }[/math]

where [math]\displaystyle{ (e_t)_t }[/math] and [math]\displaystyle{ (e_i')_i }[/math] are the canonical basis in [math]\displaystyle{ \mathbb{R}^T }[/math] and [math]\displaystyle{ \mathbb{R}^D }[/math]. In this case the role of the Frobenius inner product is to select individual elements [math]\displaystyle{ w_i^t }[/math] from the matrix [math]\displaystyle{ W }[/math]. Thus, the output [math]\displaystyle{ y }[/math] is a sampling of entries from the matrix [math]\displaystyle{ W }[/math].

The problem of reconstructing [math]\displaystyle{ W }[/math] from a small set of sampled entries is possible only under certain restrictions on the matrix, and these restrictions can be enforced by a regularization function. For example, it might be assumed that [math]\displaystyle{ W }[/math] is low-rank, in which case the regularization penalty can take the form of a nuclear norm.[2]

[math]\displaystyle{ R(W) = \lambda \|W\|_* = \lambda \sum |\sigma_i|, }[/math]

where [math]\displaystyle{ \sigma_i }[/math], with [math]\displaystyle{ i }[/math] from [math]\displaystyle{ 1 }[/math] to [math]\displaystyle{ \min D, T }[/math], are the singular values of [math]\displaystyle{ W }[/math].

Multivariate regression

Models used in multivariate regression are parameterized by a matrix of coefficients. In the Frobenius inner product above, each matrix [math]\displaystyle{ X }[/math] is

[math]\displaystyle{ X_i^t=e_t\otimes x_i \, }[/math]

such that the output of the inner product is the dot product of one row of the input with one column of the coefficient matrix. The familiar form of such models is

[math]\displaystyle{ Y=XW+b \, }[/math]

Many of the vector norms used in single variable regression can be extended to the multivariate case. One example is the squared Frobenius norm, which can be viewed as an [math]\displaystyle{ \ell^2 }[/math]-norm acting either entrywise, or on the singular values of the matrix:

[math]\displaystyle{ R(W)=\lambda \|W\|_F^2=\lambda \sum\sum |w_{ij}|^2=\lambda \operatorname{Tr}(W^*W)=\lambda \sum \sigma_i^2. }[/math]

In the multivariate case the effect of regularizing with the Frobenius norm is the same as the vector case; very complex models will have larger norms, and, thus, will be penalized more.

Multi-task learning

The setup for multi-task learning is almost the same as the setup for multivariate regression. The primary difference is that the input variables are also indexed by task (columns of [math]\displaystyle{ Y }[/math]). The representation with the Frobenius inner product is then

[math]\displaystyle{ X_i^t=e_t\otimes x_i^t. }[/math]

The role of matrix regularization in this setting can be the same as in multivariate regression, but matrix norms can also be used to couple learning problems across tasks. In particular, note that for the optimization problem

[math]\displaystyle{ \min_W \|XW-Y\|_2^2 + \lambda \|W\|_2^2 }[/math]

the solutions corresponding to each column of [math]\displaystyle{ Y }[/math] are decoupled. That is, the same solution can be found by solving the joint problem, or by solving an isolated regression problem for each column. The problems can be coupled by adding an additional regularization penalty on the covariance of solutions

[math]\displaystyle{ \min_{W,\Omega} \|XW-Y\|_2^2 + \lambda_1 \|W\|_2^2 + \lambda_2 \operatorname{Tr}(W^T \Omega^{-1} W) }[/math]

where [math]\displaystyle{ \Omega }[/math] models the relationship between tasks. This scheme can be used to both enforce similarity of solutions across tasks, and to learn the specific structure of task similarity by alternating between optimizations of [math]\displaystyle{ W }[/math] and [math]\displaystyle{ \Omega }[/math].[3] When the relationship between tasks is known to lie on a graph, the Laplacian matrix of the graph can be used to couple the learning problems.

Spectral regularization

Regularization by spectral filtering has been used to find stable solutions to problems such as those discussed above by addressing ill-posed matrix inversions (see for example Filter function for Tikhonov regularization). In many cases the regularization function acts on the input (or kernel) to ensure a bounded inverse by eliminating small singular values, but it can also be useful to have spectral norms that act on the matrix that is to be learned.

There are a number of matrix norms that act on the singular values of the matrix. Frequently used examples include the Schatten p-norms, with p = 1 or 2. For example, matrix regularization with a Schatten 1-norm, also called the nuclear norm, can be used to enforce sparsity in the spectrum of a matrix. This has been used in the context of matrix completion when the matrix in question is believed to have a restricted rank.[2] In this case the optimization problem becomes:

[math]\displaystyle{ \min \|W\|_* }[/math] subject to [math]\displaystyle{ W_{i,j}=Y_{ij}. }[/math]

Spectral Regularization is also used to enforce a reduced rank coefficient matrix in multivariate regression.[4] In this setting, a reduced rank coefficient matrix can be found by keeping just the top [math]\displaystyle{ n }[/math] singular values, but this can be extended to keep any reduced set of singular values and vectors.

Structured sparsity

Sparse optimization has become the focus of much research interest as a way to find solutions that depend on a small number of variables (see e.g. the Lasso method). In principle, entry-wise sparsity can be enforced by penalizing the entry-wise [math]\displaystyle{ \ell^0 }[/math]-norm of the matrix, but the [math]\displaystyle{ \ell^0 }[/math]-norm is not convex. In practice this can be implemented by convex relaxation to the [math]\displaystyle{ \ell^1 }[/math]-norm. While entry-wise regularization with an [math]\displaystyle{ \ell^1 }[/math]-norm will find solutions with a small number of nonzero elements, applying an [math]\displaystyle{ \ell^1 }[/math]-norm to different groups of variables can enforce structure in the sparsity of solutions.[5]

The most straightforward example of structured sparsity uses the [math]\displaystyle{ \ell_{p,q} }[/math] norm with [math]\displaystyle{ p=2 }[/math] and [math]\displaystyle{ q=1 }[/math]:

[math]\displaystyle{ \|W\|_{2,1}=\sum \|w_i\|_2. }[/math]

For example, the [math]\displaystyle{ \ell_{2,1} }[/math] norm is used in multi-task learning to group features across tasks, such that all the elements in a given row of the coefficient matrix can be forced to zero as a group.[6] The grouping effect is achieved by taking the [math]\displaystyle{ \ell^2 }[/math]-norm of each row, and then taking the total penalty to be the sum of these row-wise norms. This regularization results in rows that will tend to be all zeros, or dense. The same type of regularization can be used to enforce sparsity column-wise by taking the [math]\displaystyle{ \ell^2 }[/math]-norms of each column.

More generally, the [math]\displaystyle{ \ell_{2,1} }[/math] norm can be applied to arbitrary groups of variables:

[math]\displaystyle{ R(W)=\lambda \sum_g^G \sqrt{\sum_j^{|G_g|} |w_g^j|^2}=\lambda \sum_g^G \|w_g\|_g }[/math]

where the index [math]\displaystyle{ g }[/math] is across groups of variables, and [math]\displaystyle{ |G_g| }[/math] indicates the cardinality of group [math]\displaystyle{ g }[/math].

Algorithms for solving these group sparsity problems extend the more well-known Lasso and group Lasso methods by allowing overlapping groups, for example, and have been implemented via matching pursuit:[7] and proximal gradient methods.[8] By writing the proximal gradient with respect to a given coefficient, [math]\displaystyle{ w_g^i }[/math], it can be seen that this norm enforces a group-wise soft threshold[1]

[math]\displaystyle{ \operatorname{prox}_{\lambda,R_g}(w_g)^i=\left(w_g^i-\lambda \frac{w_g^i}{\|w_g\|_g}\right)\mathbf{1}_{\|w_g\|_g\geq \lambda}. }[/math]

where [math]\displaystyle{ \mathbf{1}_{\|w_g\|_g\geq \lambda} }[/math] is the indicator function for group norms [math]\displaystyle{ \geq \lambda }[/math].

Thus, using [math]\displaystyle{ \ell_{2,1} }[/math] norms it is straightforward to enforce structure in the sparsity of a matrix either row-wise, column-wise, or in arbitrary blocks. By enforcing group norms on blocks in multivariate or multi-task regression, for example, it is possible to find groups of input and output variables, such that defined subsets of output variables (columns in the matrix [math]\displaystyle{ Y }[/math]) will depend on the same sparse set of input variables.

Multiple kernel selection

The ideas of structured sparsity and feature selection can be extended to the nonparametric case of multiple kernel learning.[9] This can be useful when there are multiple types of input data (color and texture, for example) with different appropriate kernels for each, or when the appropriate kernel is unknown. If there are two kernels, for example, with feature maps [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] that lie in corresponding reproducing kernel Hilbert spaces [math]\displaystyle{ \mathcal{H_A},\mathcal{H_B} }[/math], then a larger space, [math]\displaystyle{ \mathcal{H_D} }[/math], can be created as the sum of two spaces:

[math]\displaystyle{ \mathcal{H_D}: f=h+h'; h\in \mathcal{H_A}, h'\in \mathcal{H_B} }[/math]

assuming linear independence in [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math]. In this case the [math]\displaystyle{ \ell_{2,1} }[/math]-norm is again the sum of norms:

[math]\displaystyle{ \|f\|_{\mathcal{H_D},1}=\|h\|_{\mathcal{H_A}}+\|h'\|_{\mathcal{H_B}} }[/math]

Thus, by choosing a matrix regularization function as this type of norm, it is possible to find a solution that is sparse in terms of which kernels are used, but dense in the coefficient of each used kernel. Multiple kernel learning can also be used as a form of nonlinear variable selection, or as a model aggregation technique (e.g. by taking the sum of squared norms and relaxing sparsity constraints). For example, each kernel can be taken to be the Gaussian kernel with a different width.

See also

References

  1. 1.0 1.1 Rosasco, Lorenzo; Poggio, Tomaso (December 2014). "A Regularization Tour of Machine Learning". MIT-9.520 Lectures Notes (Manuscript). https://www.mit.edu/~9.520/fall15/Classes/learning_problem.html. 
  2. 2.0 2.1 Candès, Emmanuel J.; Recht, Benjamin (2009). "Exact Matrix Completion via Convex Optimization". Foundations of Computational Mathematics 9 (6): 717–772. doi:10.1007/s10208-009-9045-5. 
  3. Zhang; Yeung (2012). "A Convex Formulation for Learning Task Relationships in Multi-Task Learning". Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010). Bibcode2012arXiv1203.3536Z. 
  4. Izenman, Alan J. (1975). "Reduced Rank Regression for the Multivariate Linear Model". Journal of Multivariate Analysis 5 (2): 248–264. doi:10.1016/0047-259X(75)90042-1. 
  5. Kakade; Shalev-Shwartz; Tewari (2012). "Regularization Techniques for Learning with Matrices". Journal of Machine Learning Research 13: 1865–1890. http://www.jmlr.org/papers/v13/kakade12a.html. 
  6. Argyriou, A.; Evgeniou, T.; Pontil, M. (2008). "Convex multi-task feature learning". Machine Learning 73 (3): 243–272. doi:10.1007/s10994-007-5040-8. 
  7. Huang; Zhang; Metaxas (2011). "Learning with Structured Sparsity". Journal of Machine Learning Research 12: 3371–3412. http://www.jmlr.org/papers/v12/huang11b.html. 
  8. Chen, Xi et al. (2012). "Smoothing Proximal Gradient Method for General Structured Sparse Regression". Annals of Applied Statistics 6 (2): 719–752. doi:10.1214/11-AOAS514. 
  9. Sonnenburg; Ratsch; Schafer; Scholkopf (2006). "Large Scale Multiple Kernel Learning". Journal of Machine Learning Research 7: 1531–1565. http://www.jmlr.org/papers/v7/sonnenburg06a.html.