Fast approximation of matrix coherence and statistical leverage (1109.3843v2)

Published 18 Sep 2011 in cs.DS, cs.DM, and cs.LG

Abstract: The statistical leverage scores of a matrix $A$ are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as output relative-error approximations to all $n$ of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of $n$ and $d$) in $O(n d \log n)$ time, as opposed to the $O(nd^2)$ time required by the na\"{i}ve algorithm that involves computing an orthogonal basis for the range of $A$. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with $n \approx d$, and the extension to streaming environments.

Citations (522)

View on Semantic Scholar

Summary

The paper presents a randomized algorithm that approximates statistical leverage scores with relative error, reducing computation from O(nd²) to O(nd log n).
The approach extends to general matrices and streaming scenarios, enabling efficient real-time analysis of massive datasets.
It also approximates cross-leverage scores, enhancing tasks like low-rank matrix approximations and outlier detection.

Fast Approximation of Matrix Coherence and Statistical Leverage

The paper "Fast approximation of matrix coherence and statistical leverage" by Drineas et al. presents a novel algorithm for estimating statistical leverage scores and matrix coherence, which are vital for numerous large-scale data analysis and machine learning tasks. These concepts are particularly relevant in matrix completion and low-rank matrix approximations, areas where efficient computation of these scores can significantly enhance algorithmic performance.

Overview

Statistical leverage scores are determined by the squared row norms of a matrix containing its top left singular vectors, while coherence is defined as the largest leverage score. These scores are instrumental in identifying structural nonuniformity, influencing the performance and efficiency of randomized matrix algorithms. This paper introduces a randomized algorithm designed to approximate these scores with relative error accuracy in time $O(nd \log n)$ , significantly improving over the naive $O(nd^2)$ approach required for computing an orthogonal basis of the matrix.

Contributions

Algorithm Efficiency: The primary contribution is a randomized algorithm that approximates leverage scores in $O(nd \log n)$ time. This is substantial for applications involving large matrices where $n$ is much greater than $d$ .
Extensions to General Matrices: The paper extends its basic algorithm to handle matrices where $n \approx d$ , broadening the scope of practical applications.
Approximation of Cross-Leverage Scores: Beyond individual leverage scores, the algorithm also approximates cross-leverage scores, providing a comprehensive statistical summary of the matrix in applications such as low-rank approximations.
Streaming Environments: A significant aspect is adapting the algorithm for streaming environments, making it feasible to compute leverage-related statistics on massive datasets with limited memory and multiple passes over the data.

Results

The algorithm achieves relative-error approximations for leverage scores in $O(nd \log n)$ , contrasting with the naive $O(nd^2)$ complexity. This improvement leverages fast randomized projections and sketches of the matrix, maintaining computational feasibility even for exceedingly large matrices.

Implications

Algorithmic Developments: This work underpins the development of faster randomized algorithms for problems such as least-squares regression and low-rank matrix approximation, where leverage scores are used for non-uniform sampling and preprocessing.
Practical Applications: In numerical linear algebra and data analysis, the rapid estimation of leverage scores can enhance tasks such as identifying outliers, optimizing data subsampling, and computing matrix decompositions more efficiently.
Large-scale Analyses: The ability to process data streams and approximate leverage scores in real-time is crucial for modern data-intensive applications, including those in genetics and material science.

Future Directions

This framework sets the stage for further exploration into fast randomized algorithms across broader domains in matrix computations and data processing. Subsequent research might focus on empirical evaluations of this algorithm in diverse applications, enhancing its numerical stability and integration into existing data processing pipelines.

Conclusion

Drineas et al.’s work represents a significant advancement in the estimation of statistical properties of matrices, emphasizing computational efficiency and adaptability to large-scale environments. The proposed algorithms offer robust solutions to challenges inherent in processing large matrices, opening avenues for more scalable data analysis techniques.

PDF Markdown