L1-norm principal component analysis

From HandWiki
L1-PCA compared with PCA. Nominal data (blue points); outlier (red point); PC (black line); L1-PC (red line); nominal maximum-variance line (dotted line).

L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis.[1] L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers (faulty values or corruptions).[2][3][4]

Both L1-PCA and standard PCA seek a collection of orthogonal directions (principal components) that define a subspace wherein data representation is maximized according to the selected criterion.[5][6][7] Standard PCA quantifies data representation as the aggregate of the L2-norm of the data point projections into the subspace, or equivalently the aggregate Euclidean distance of the original points from their subspace-projected representations. L1-PCA uses instead the aggregate of the L1-norm of the data point projections into the subspace.[8] In PCA and L1-PCA, the number of principal components (PCs) is lower than the rank of the analyzed matrix, which coincides with the dimensionality of the space defined by the original data points. Therefore, PCA or L1-PCA are commonly employed for dimensionality reduction for the purpose of data denoising or compression. Among the advantages of standard PCA that contributed to its high popularity are low-cost computational implementation by means of singular-value decomposition (SVD)[9] and statistical optimality when the data set is generated by a true multivariate normal data source.

However, in modern big data sets, data often include corrupted, faulty points, commonly referred to as outliers.[10] Standard PCA is known to be sensitive to outliers, even when they appear as a small fraction of the processed data.[11] The reason is that the L2-norm formulation of L2-PCA places squared emphasis on the magnitude of each coordinate of each data point, ultimately overemphasizing peripheral points, such as outliers. On the other hand, following an L1-norm formulation, L1-PCA places linear emphasis on the coordinates of each data point, effectively restraining outliers.[12]

Formulation

Consider any matrix [math]\displaystyle{ \mathbf X = [\mathbf x_1, \mathbf x_2, \ldots, \mathbf x_N] \in \mathbb R^{D \times N} }[/math] consisting of [math]\displaystyle{ N }[/math] [math]\displaystyle{ D }[/math]-dimensional data points. Define [math]\displaystyle{ r=rank(\mathbf X) }[/math]. For integer [math]\displaystyle{ K }[/math] such that [math]\displaystyle{ 1 \leq K \lt r }[/math], L1-PCA is formulated as:[1]

[math]\displaystyle{ \begin{align} &\underset{\mathbf Q=[\mathbf q_1, \mathbf q_2, \ldots, \mathbf q_K] \in \mathbb R^{D \times K}}{\max}~~\| \mathbf X^\top \mathbf Q\|_1\\ &\text{subject to}~~ \mathbf Q^\top \mathbf Q=\mathbf I_K. \end{align} }[/math]

 

 

 

 

(1)

For [math]\displaystyle{ K=1 }[/math], (1) simplifies to finding the L1-norm principal component (L1-PC) of [math]\displaystyle{ \mathbf X }[/math] by

[math]\displaystyle{ \begin{align} &\underset{\mathbf q \in \mathbb R^{D \times 1}}{\max}~~\| \mathbf X^\top \mathbf q\|_1\\ &\text{subject to}~~ \| \mathbf q\|_2 =1. \end{align} }[/math]

 

 

 

 

(2)

In (1)-(2), L1-norm [math]\displaystyle{ \| \cdot \|_1 }[/math] returns the sum of the absolute entries of its argument and L2-norm [math]\displaystyle{ \| \cdot \|_2 }[/math] returns the sum of the squared entries of its argument. If one substitutes [math]\displaystyle{ \| \cdot \|_1 }[/math] in (2) by the Frobenius/L2-norm [math]\displaystyle{ \| \cdot \|_F }[/math], then the problem becomes standard PCA and it is solved by the matrix [math]\displaystyle{ \mathbf Q }[/math] that contains the [math]\displaystyle{ K }[/math] dominant singular vectors of [math]\displaystyle{ \mathbf X }[/math] (i.e., the singular vectors that correspond to the [math]\displaystyle{ K }[/math] highest singular values).

The maximization metric in (2) can be expanded as

[math]\displaystyle{ \| \mathbf X^\top \mathbf Q\|_1=\sum_{k=1}^K \sum_{n=1}^N |\mathbf x_n^\top \mathbf q_k|. }[/math]

 

 

 

 

(3)

Solution

For any matrix [math]\displaystyle{ \mathbf A \in \mathbb R^{m \times n} }[/math] with [math]\displaystyle{ m \geq n }[/math], define [math]\displaystyle{ \Phi(\mathbf A) }[/math] as the nearest (in the L2-norm sense) matrix to [math]\displaystyle{ \mathbf A }[/math] that has orthonormal columns. That is, define

[math]\displaystyle{ \begin{align} \Phi(\mathbf A) = & \underset{\mathbf Q \in \mathbb R^{m \times n}}{\text{argmin}}~~\| \mathbf A - \mathbf Q\|_F\\ &\text{subject to}~~ \mathbf Q^\top \mathbf Q=\mathbf I_n. \end{align} }[/math]

 

 

 

 

(4)

Procrustes Theorem[13][14] states that if [math]\displaystyle{ \mathbf A }[/math] has SVD [math]\displaystyle{ \mathbf U_{m \times n} \boldsymbol \Sigma_{n \times n} \mathbf V_{n \times n}^\top }[/math], then [math]\displaystyle{ \Phi(\mathbf A)=\mathbf U \mathbf V^\top }[/math].

Markopoulos, Karystinos, and Pados[1] showed that, if [math]\displaystyle{ \mathbf B_{\text{BNM}} }[/math] is the exact solution to the binary nuclear-norm maximization (BNM) problem

[math]\displaystyle{ \begin{align} \underset{\mathbf B \in \{ \pm 1\}^{N \times K}}{\text{max}}~~\| \mathbf X \mathbf B\|_*^2, \end{align} }[/math]

 

 

 

 

(5)

then

[math]\displaystyle{ \begin{align} \mathbf Q_{\text{L1}} = \Phi(\mathbf X\mathbf B_{\text{BNM}}) \end{align} }[/math]

 

 

 

 

(6)

is the exact solution to L1-PCA in (2). The nuclear-norm [math]\displaystyle{ \| \cdot \|_* }[/math] in (2) returns the summation of the singular values of its matrix argument and can be calculated by means of standard SVD. Moreover, it holds that, given the solution to L1-PCA, [math]\displaystyle{ \mathbf Q_{\text{L1}} }[/math], the solution to BNM can be obtained as

[math]\displaystyle{ \begin{align} \mathbf B_{\text{BNM}} = \text{sgn}(\mathbf X^\top \mathbf Q_{\text{L1}}) \end{align} }[/math]

 

 

 

 

(7)

where [math]\displaystyle{ \text{sgn}(\cdot) }[/math] returns the [math]\displaystyle{ \{\pm 1\} }[/math]-sign matrix of its matrix argument (with no loss of generality, we can consider [math]\displaystyle{ \text{sgn}(0)=1 }[/math]). In addition, it follows that [math]\displaystyle{ \| \mathbf X^\top \mathbf Q_{\text{L1}}\|_1 = \| \mathbf X \mathbf B_{\text{BNM}}\|_* }[/math]. BNM in (5) is a combinatorial problem over antipodal binary variables. Therefore, its exact solution can be found through exhaustive evaluation of all [math]\displaystyle{ 2^{NK} }[/math] elements of its feasibility set, with asymptotic cost [math]\displaystyle{ \mathcal O(2^{NK}) }[/math]. Therefore, L1-PCA can also be solved, through BNM, with cost [math]\displaystyle{ \mathcal O(2^{NK}) }[/math] (exponential in the product of the number of data points with the number of the sought-after components). It turns out that L1-PCA can be solved optimally (exactly) with polynomial complexity in [math]\displaystyle{ N }[/math] for fixed data dimension [math]\displaystyle{ D }[/math], [math]\displaystyle{ \mathcal{O}(N^{rK-K+1}) }[/math].[1]

For the special case of [math]\displaystyle{ K=1 }[/math] (single L1-PC of [math]\displaystyle{ \mathbf X }[/math]), BNM takes the binary-quadratic-maximization (BQM) form

[math]\displaystyle{ \begin{align} & \underset{\mathbf b \in \{ \pm 1\}^{N \times 1}}{\text{max}}~~ \mathbf b^\top \mathbf X^\top \mathbf X \mathbf b. \end{align} }[/math]

 

 

 

 

(8)

The transition from (5) to (8) for [math]\displaystyle{ K=1 }[/math] holds true, since the unique singular value of [math]\displaystyle{ \mathbf X \mathbf b }[/math] is equal to [math]\displaystyle{ \| \mathbf X \mathbf b\|_2 = \sqrt{\mathbf b^\top \mathbf X^\top \mathbf X \mathbf b} }[/math], for every [math]\displaystyle{ \mathbf b }[/math]. Then, if [math]\displaystyle{ \mathbf b_{\text{BNM}} }[/math] is the solution to BQM in (7), it holds that

[math]\displaystyle{ \begin{align} \mathbf q_{\text{L1}} = \Phi(\mathbf X \mathbf b_{\text{BNM}}) = \frac{\mathbf X \mathbf b_{\text{BNM}}}{\| \mathbf X \mathbf b_{\text{BNM}}\|_2} \end{align} }[/math]

 

 

 

 

(9)

is the exact L1-PC of [math]\displaystyle{ \mathbf X }[/math], as defined in (1). In addition, it holds that [math]\displaystyle{ \mathbf b_{\text{BNM}} = \text{sgn}(\mathbf X^\top \mathbf q_{\text{L1}}) }[/math] and [math]\displaystyle{ \| \mathbf X^\top \mathbf q_{\text{L1}}\|_1 = \| \mathbf X \mathbf b_{\text{BNM}}\|_2 }[/math].

Algorithms

Exact solution of exponential complexity

As shown above, the exact solution to L1-PCA can be obtained by the following two-step process:

1. Solve the problem in (5) to obtain [math]\displaystyle{ \mathbf B_{\text{BNM}} }[/math].
2. Apply SVD on [math]\displaystyle{ \mathbf X\mathbf B_{\text{BNM}} }[/math] to obtain [math]\displaystyle{ \mathbf Q_{\text{L1}} }[/math].

BNM in (5) can be solved by exhaustive search over the domain of [math]\displaystyle{ \mathbf B }[/math] with cost [math]\displaystyle{ \mathcal{O}(2^{NK}) }[/math].

Exact solution of polynomial complexity

Also, L1-PCA can be solved optimally with cost [math]\displaystyle{ \mathcal{O}(N^{rK-K+1}) }[/math], when [math]\displaystyle{ r=rank(\mathbf X) }[/math] is constant with respect to [math]\displaystyle{ N }[/math] (always true for finite data dimension [math]\displaystyle{ D }[/math]).[1][15]

Approximate efficient solvers

In 2008, Kwak[12] proposed an iterative algorithm for the approximate solution of L1-PCA for [math]\displaystyle{ K=1 }[/math]. This iterative method was later generalized for [math]\displaystyle{ K\gt 1 }[/math] components.[16] Another approximate efficient solver was proposed by McCoy and Tropp[17] by means of semi-definite programming (SDP). Most recently, L1-PCA (and BNM in (5)) were solved efficiently by means of bit-flipping iterations (L1-BF algorithm).[8][18]

L1-BF algorithm

 1  function L1BF([math]\displaystyle{ \mathbf X }[/math], [math]\displaystyle{ K }[/math]):
 2      Initialize [math]\displaystyle{ \mathbf B^{(0)} \in \{\pm 1\}^{N \times K} }[/math] and [math]\displaystyle{ \mathcal L \leftarrow \{1,2,\ldots, NK\} }[/math]
 3      Set [math]\displaystyle{ t \leftarrow 0 }[/math] and [math]\displaystyle{ \omega \leftarrow \| \mathbf X \mathbf B^{(0)} \|_* }[/math]
 4      Until termination (or [math]\displaystyle{ T }[/math] iterations)
 5          [math]\displaystyle{ \mathbf B \leftarrow \mathbf B^{(t)} }[/math], [math]\displaystyle{ t' \leftarrow t }[/math]
 6          For [math]\displaystyle{ x \in \mathcal L }[/math]
 7              [math]\displaystyle{ k \leftarrow \lceil \frac{x}{N} \rceil }[/math], [math]\displaystyle{ n \leftarrow x-N(k-1) }[/math]
 8              [math]\displaystyle{ [\mathbf B]_{n,k} \leftarrow - [\mathbf B]_{n,k} }[/math]                // flip bit
 9              [math]\displaystyle{ a(n,k) \leftarrow \| \mathbf X \mathbf B \|_* }[/math]               // calculated by SVD or faster (see[8])
10              if [math]\displaystyle{ a(n,k)\gt \omega }[/math]
11                  [math]\displaystyle{ \mathbf B^{(t)} \leftarrow  \mathbf B }[/math], [math]\displaystyle{ t' \leftarrow t+1 }[/math]
12                  [math]\displaystyle{ \omega \leftarrow  a(n,k) }[/math]
13              end
14              if [math]\displaystyle{ t'=t }[/math]                    // no bit was flipped
15                  if [math]\displaystyle{ \mathcal L = \{1,2, \ldots, NK\} }[/math]
16                      terminate
17                  else
18                      [math]\displaystyle{ \mathcal L \leftarrow \{1,2, \ldots, NK\} }[/math]

The computational cost of L1-BF is [math]\displaystyle{ \mathcal O (ND min\{N,D\} + N^2K^2(K^2 + r)) }[/math].[8]

Complex data

L1-PCA has also been generalized to process complex data. For complex L1-PCA, two efficient algorithms were proposed in 2018.[19]

Tensor data

L1-PCA has also been extended for the analysis of tensor data, in the form of L1-Tucker, the L1-norm robust analogous of standard Tucker decomposition.[20] Two algorithms for the solution of L1-Tucker are L1-HOSVD and L1-HOOI.[20][21][22]

Code

MATLAB code for L1-PCA is available at MathWorks[23] and other repositories.[18]

References

  1. 1.0 1.1 1.2 1.3 1.4 Markopoulos, Panos P.; Karystinos, George N.; Pados, Dimitris A. (October 2014). "Optimal Algorithms for L1-subspace Signal Processing". IEEE Transactions on Signal Processing 62 (19): 5046–5058. doi:10.1109/TSP.2014.2338077. Bibcode2014ITSP...62.5046M. 
  2. Barrodale, I. (1968). "L1 Approximation and the Analysis of Data". Applied Statistics 17 (1): 51–57. doi:10.2307/2985267. 
  3. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN 978-0471930945. 
  4. Kanade, T.; Ke, Qifa (June 2005). "Robust L₁ Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). 1. IEEE. pp. 739–746. doi:10.1109/CVPR.2005.309. ISBN 978-0-7695-2372-9. 
  5. Jolliffe, I.T. (2004). Principal component analysis (2nd ed.). New York: Springer. ISBN 978-0387954424. https://archive.org/details/principalcompone00joll_0. 
  6. Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN 978-0-387-31073-2. 
  7. Pearson, Karl (8 June 2010). "On Lines and Planes of Closest Fit to Systems of Points in Space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–572. doi:10.1080/14786440109462720. https://zenodo.org/record/1430636. 
  8. 8.0 8.1 8.2 8.3 Markopoulos, Panos P.; Kundu, Sandipan; Chamadia, Shubham; Pados, Dimitris A. (15 August 2017). "Efficient L1-Norm Principal-Component Analysis via Bit Flipping". IEEE Transactions on Signal Processing 65 (16): 4252–4264. doi:10.1109/TSP.2017.2708023. Bibcode2017ITSP...65.4252M. 
  9. Golub, Gene H. (April 1973). "Some Modified Matrix Eigenvalue Problems". SIAM Review 15 (2): 318–334. doi:10.1137/1015032. 
  10. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN 978-0471930945. 
  11. Candès, Emmanuel J.; Li, Xiaodong; Ma, Yi; Wright, John (1 May 2011). "Robust principal component analysis?". Journal of the ACM 58 (3): 1–37. doi:10.1145/1970392.1970395. 
  12. 12.0 12.1 Kwak, N. (September 2008). "Principal Component Analysis Based on L1-Norm Maximization". IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (9): 1672–1680. doi:10.1109/TPAMI.2008.114. PMID 18617723. 
  13. Eldén, Lars; Park, Haesun (1 June 1999). "A Procrustes problem on the Stiefel manifold". Numerische Mathematik 82 (4): 599–619. doi:10.1007/s002110050432. 
  14. Schönemann, Peter H. (March 1966). "A generalized solution of the orthogonal procrustes problem". Psychometrika 31 (1): 1–10. doi:10.1007/BF02289451. 
  15. Markopoulos, PP; Kundu, S; Chamadia, S; Tsagkarakis, N; Pados, DA (2018). "Outlier-Resistant Data Processing with L1-Norm Principal Component Analysis". Advances in Principal Component Analysis. pp. 121–135. doi:10.1007/978-981-10-6704-4_6. ISBN 978-981-10-6703-7. 
  16. Nie, F; Huang, H; Ding, C; Luo, Dijun; Wang, H (July 2011). "Robust principal component analysis with non-greedy l1-norm maximization". 22nd International Joint Conference on Artificial Intelligence: 1433–1438. 
  17. McCoy, Michael; Tropp, Joel A. (2011). "Two proposals for robust PCA using semidefinite programming". Electronic Journal of Statistics 5: 1123–1160. doi:10.1214/11-EJS636. 
  18. 18.0 18.1 Markopoulos, PP. "Software Repository". https://people.rit.edu/pxmeee/soft.html. [yes|permanent dead link|dead link}}]
  19. Tsagkarakis, Nicholas; Markopoulos, Panos P.; Sklivanitis, George; Pados, Dimitris A. (15 June 2018). "L1-Norm Principal-Component Analysis of Complex Data". IEEE Transactions on Signal Processing 66 (12): 3256–3267. doi:10.1109/TSP.2018.2821641. Bibcode2018ITSP...66.3256T. 
  20. 20.0 20.1 Chachlakis, Dimitris G.; Prater-Bennette, Ashley; Markopoulos, Panos P. (22 November 2019). "L1-norm Tucker Tensor Decomposition". IEEE Access 7: 178454–178465. doi:10.1109/ACCESS.2019.2955134. 
  21. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Prater-Bennette, Ashley (21 February 2019). "L1-Norm Higher-Order Singular-Value Decomposition". 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). pp. 1353–1357. doi:10.1109/GlobalSIP.2018.8646385. ISBN 978-1-7281-1295-4. 
  22. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Papalexakis, Evangelos (April 2018). "The Exact Solution to Rank-1 L1-Norm TUCKER2 Decomposition". IEEE Signal Processing Letters 25 (4): 511–515. doi:10.1109/LSP.2018.2790901. Bibcode2018ISPL...25..511M. 
  23. "L1-PCA TOOLBOX". https://www.mathworks.com/matlabcentral/fileexchange/64855-l1-pca-toolbox.