Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dimensionality Reduction of Massive Sparse Datasets Using Coresets (1503.01663v1)

Published 5 Mar 2015 in cs.DS

Abstract: In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the low rank approximation (reduced SVD) of such matrices. Our solution uses coresets, which is a subset of $O(k/\eps2)$ scaled rows from the $n\times d$ input matrix, that approximates the sub of squared distances from its rows to every $k$-dimensional subspace in $\REALd$, up to a factor of $1\pm\eps$. An open theoretical problem has been whether we can compute such a coreset that is independent of the input matrix and also a weighted subset of its rows. %An open practical problem has been whether we can compute a non-trivial approximation to the reduced SVD of very large databases such as the Wikipedia document-term matrix in a reasonable time. We answer this question affirmatively. % and demonstrate an algorithm that efficiently computes a low rank approximation of the entire English Wikipedia. Our main technical result is a novel technique for deterministic coreset construction that is based on a reduction to the problem of $\ell_2$ approximation for item frequencies.

Citations (53)

Summary

We haven't generated a summary for this paper yet.