Robust PCA via Outlier Pursuit (1010.4237v2)

Published 20 Oct 2010 in cs.LG, cs.IT, math.IT, and stat.ML

Abstract: Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire points that are completely corrupted. We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinformatics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column space of the uncorrupted matrix, rather than the exact matrix itself. In any problem where one seeks to recover a structure rather than the exact initial matrices, techniques developed thus far relying on certificates of optimality, will fail. We present an important extension of these methods, that allows the treatment of such problems.

Citations (757)

View on Semantic Scholar

Summary

The paper introduces Outlier Pursuit, a convex optimization approach for robustly recovering low-rank structures and identifying corrupted columns.
It establishes exact recovery conditions based on the fraction of corrupted data and incoherence parameters, ensuring strong theoretical guarantees.
The method demonstrates noise resilience by providing bounded error approximations for both low-rank and sparse components in corrupted datasets.

Robust PCA via Outlier Pursuit: An In-Depth Analysis

Principal Component Analysis (PCA) is a cornerstone technique for dimensionality reduction, widely applied in diverse fields such as statistics, bioinformatics, and finance. However, its well-documented sensitivity to outliers limits its robustness and applicability in real-world scenarios. The paper "Robust PCA via Outlier Pursuit" by Huan Xu, Constantine Caramanis, and Sujay Sanghavi introduces a novel convex optimization approach to address this limitation, aptly titled Outlier Pursuit. This method aims to achieve exact recovery of the low-dimensional subspace and precise identification of corrupted points, given that certain conditions are met.

Problem Formulation

The core problem tackled by the paper is the decomposition of a data matrix $M$ into a low-rank matrix $L_0$ and a sparse outlier matrix $C_0$ . Formally, it is posited that:

$M = L_0 + C_0,$

where $L_0$ is a low-rank matrix, and $C_0$ is column-sparse. The challenge is to recover the column-space of $L_0$ and the non-zero columns of $C_0$ both exactly and efficiently, especially in the presence of numerous arbitrarily corrupted columns.

Main Contributions

The paper's significant contributions are manifold:

Convex Optimization Approach: Introducing Outlier Pursuit, a convex optimization problem using nuclear norm minimization for $L_0$ and a column-wise $\ell_{1,2}$ norm for $C_0$ :

$\begin{aligned} \text{Minimize:}\quad & \|L\|_* + \lambda \|C\|_{1,2} \ \text{Subject to:}\quad & M = L + C, \end{aligned}$

where $\|L\|_*$ is the nuclear norm of $L$ , and $\|C\|_{1,2}$ is the sum of the $\ell_2$ norms of the columns of $C$ .
Exact Recovery Conditions: Establishing conditions under which exact recovery of the subspace and outliers is guaranteed. The primary condition is related to the fraction of corrupted points ( $\gamma$ ) and the incoherence parameter ( $\mu$ ) of $L_0$ :

$\frac{\gamma}{1 - \gamma} \leq \frac{c_1}{\mu r},$

where $c_1 = \frac{9}{121}$ .
Noise Robustness: Extending the analysis to cases where $M$ is additionally corrupted by noise ( $N$ ), and showing that the proposed method still approximately recovers the column space and outlier indices:

$\text{Minimize:}\quad \|L\|_* + \lambda \|C\|_{1,2} \ \text{Subject to:}\quad \|M - (L + C)\|_F \leq \varepsilon,$

where $\varepsilon$ represents the noise level.

Key Results

The theoretical results are robust:

Noiseless Case: Outlier Pursuit can recover the exact column space and outliers if the fraction of corrupted points is within the specified bounds. For instance, for $\lambda = \frac{3}{7 \sqrt{\gamma n}}$ , the method guarantees exact recovery.
Noisy Case: When noise is present, the recovery achieves an approximation where the recovery error for $L'$ (low-rank approximation) and $C'$ (outliers) is bounded by:

$\|L' - \tilde{L}\|_F \leq 10\sqrt{n}\varepsilon \quad \text{and} \quad \|C' - \tilde{C}\|_F \leq 9\sqrt{n}\varepsilon.$

Implications and Future Directions

The practical and theoretical implications of this work are significant. In domains like bioinformatics and finance, where data is often corrupted or contains outliers, Outlier Pursuit offers a robust alternative to standard PCA. The convex optimization approach ensures computational efficiency, making it suitable for large-scale applications.

For future research, extending these results to more complex corruption models, such as partial observations and dynamic environments, could be highly beneficial. Additionally, exploring non-convex formulations that might offer better empirical performance without sacrificing theoretical guarantees is another promising direction.

Conclusion

Robust PCA via Outlier Pursuit stands out as an essential advancement in making PCA robust to outliers. By leveraging convex optimization techniques, it provides theoretical guarantees for exact recovery, addressing a critical gap in existing dimensionality reduction methods. This work paves the way for robust data analysis in practical scenarios where exact low-rank recovery and outlier detection are paramount.

PDF Markdown