An Improved Approximation Algorithm for the Column Subset Selection Problem (0812.4293v2)

Published 22 Dec 2008 in cs.DS

Abstract: We consider the problem of selecting the best subset of exactly $k$ columns from an $m \times n$ matrix $A$. We present and analyze a novel two-stage algorithm that runs in $O(\min{mn^{2,m^2n})$} time and returns as output an $m \times k$ matrix $C$ consisting of exactly $k$ columns of $A$. In the first (randomized) stage, the algorithm randomly selects $\Theta(k \log k)$ columns according to a judiciously-chosen probability distribution that depends on information in the top-$k$ right singular subspace of $A$. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly $k$ columns from the set of columns selected in the first stage. Let $C$ be the $m \times k$ matrix containing those $k$ columns, let $P_C$ denote the projection matrix onto the span of those columns, and let $A_k$ denote the best rank-$k$ approximation to the matrix $A$. Then, we prove that, with probability at least 0.8, $$ \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. $$ This Frobenius norm bound is only a factor of $\sqrt{k \log k}$ worse than the best previously existing existential result and is roughly $O(\sqrt{k!})$ better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, $$ \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{{3/4}\log^{{1/4}k)\FNorm{A-A_k}.}} $$ This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on $\FNorm{A-A_k}$, whereas previous results depend on $\sqrt{n-k}\TNorm{A-A_k}$; if these two quantities are comparable, then our bound is asymptotically worse by a $(k \log k)^{1/4}$ factor.

Citations (402)

View on Semantic Scholar

Summary

The paper introduces a two-stage algorithm that combines randomized sampling and deterministic refinement to select key matrix columns.
It achieves a time complexity of O(min{mn², m²n}) and attains a Frobenius norm approximation factor of Θ(√(k log k)).
The methodology builds on RRQR factorizations and geometric functional analysis, setting the stage for future matrix approximation research.

An Improved Approximation Algorithm for the Column Subset Selection Problem

The paper introduces a novel two-stage algorithm designed to address the Column Subset Selection Problem (CSSP), a challenging task in both theoretical computer science and numerical linear algebra. The CSSP involves selecting the "best" subset of exactly $k$ columns from an $m \times n$ matrix $A$ to form a matrix $C$ such that the product of the projection of $A$ onto the span of $C$ closely approximates $A$ .

Main Results and Contributions

This research presents an algorithm with a new approach that runs in $O(\min\{mn^2, m^2n\})$ time complexity. In the first randomized stage, the algorithm selects $\Theta(k \log k)$ columns based on a probability distribution informed by the top- $k$ right singular subspace of $A$ . In the second deterministic stage, these columns are further refined to exactly $k$ through a meticulous selection process. The result is a matrix $C$ with substantial guarantees on minimizing the residual $\FNorm{A - P_CA}$ and $\TNorm{A - P_CA}$, where $P_C = CC^+$ is the orthogonal projection onto the span of $C$ .

Numerically, the paper claims that with at least 0.8 probability, the Frobenius norm version of the CSSP approximation is only a factor $\Theta(\sqrt{k \log k})$ worse than the best-known existential guarantee and roughly $O(\sqrt{k!})$ superior to the previous best algorithmic results. The spectral norm result is more nuanced; it depends on $\FNorm{A - A_k}$ and is asymptotically worse by a $(k \log k)^{1/4}$ factor if certain conditions hold.

Computational and Theoretical Implications

Practically, the main advantage of this algorithm is its computational efficiency. Given its design, it is capable of efficiently handling large-scale matrices commonly seen in modern data problems. The algorithm's dependence on $c = \Theta(k \log k)$ columns before refinement signifies a balanced trade-off between computational feasibility and approximation precision.

Theoretically, this work extends existing frameworks and builds on techniques such as the Rank Revealing QR (RRQR) factorizations and randomized sampling approaches. It integrates recent advances from geometric functional analysis to provide robust performance guarantees for CSSP, potentially inspiring further integration of these methodologies into other matrix approximation challenges.

Future Directions

This work sets a strong precedent for further research into both the CSSP and related approximative problems, such as CUR decompositions and matrix sketches. Improvements to the probability distribution used in sampling, alternative deterministic selection procedures, or the adaptation in parallel computing environments could enhance the current model's applicability and efficiency. Moreover, further theoretical exploration into the potential bounds and complexities in different matrix norms might yield insights that could inform the development of even more precise algorithms.

Overall, this paper represents a progressive step forward in the approximation of high-dimensional matrices, delivering improved algorithms with broad implications across scientific computing, machine learning, and data science domains. Its blend of theoretical rigor and practical viability marks it as a significant contribution to the matrix approximation literature.

PDF Markdown