Practical and Optimal LSH for Angular Distance (1509.02897v1)

Published 9 Sep 2015 in cs.DS, cs.CG, and cs.IR

Abstract: We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.

Citations (450)

View on Semantic Scholar

Summary

The paper introduces cross-polytope LSH that achieves asymptotically optimal query performance with practical efficiency for angular distance problems.
It establishes fine-grained lower bounds on time-space trade-offs, ensuring near-optimal performance even in high-dimensional data scenarios.
The multiprobe variant reduces space complexity and enhances query speeds, delivering up to 10× faster performance on real-world datasets.

Practical and Optimal LSH for Angular Distance

The paper "Practical and Optimal LSH for Angular Distance" by Alexandr Andoni et al. advances the discussion and methodology around Locality-Sensitive Hashing (LSH) for angular distances. It introduces a novel LSH family, demonstrating asymptotically optimal running time while maintaining practical applicability, unlike prior optimal solutions.

Key Contributions

The paper makes the following primary contributions:

Cross-Polytope LSH: Introducing a hash function based on randomly rotated cross-polytopes, achieving a parameter $\rho$ equivalent to the Spherical LSH scheme, while being computationally feasible. Theoretical analysis is provided, supporting its optimality.
Fine-Grained Lower Bounds: Establishing new non-asymptotic lower bounds on LSH families' trade-off between evaluation time and quality. This implies the cross-polytope LSH achieves near-optimal trade-offs, reflecting its theoretical and practical potential.
Multiprobe LSH: Implementing a multiprobe variant of the cross-polytope LSH reduces space complexity while enhancing query performance. The multiprobe adaptation achieves significant speed improvements over traditional hyperplane LSH methods without an extensive memory footprint.

Theoretical Insights

Theoretical analysis of the LSH family revolves around the cross-polytope and its capacity to distinguish angular distances. Specifically, the paper achieves:

A running time of $O(n^\rho)$ and space complexity of $O(n^{1 + \rho})$ for $\rho = \frac{1}{2c^2 - 1}$ , a factor shown to be optimal for a significant algorithm class.
Constructive use of pseudo-random rotations and feature hashing to maintain high efficiency even in high-dimensional sparse scenarios.

The paper also establishes conditions under which the cross-polytope LSH algorithm can approach theoretical bounds and provides numerical evaluations that verify this proximity.

Practical Implications

Experimental results on real and synthetic datasets underscore the practical advantages across various applications, such as handling SIFT vectors and tf-idf datasets. For data sizes between $10^5$ to $10^8$ entries, the multiprobe cross-polytope LSH achieves considerable speedups (up to $10\times$ faster than hyperplane LSH, with a substantial $700\times$ improvement over linear scans).

Such improvements have direct implications for applications in computer vision, machine learning, and information retrieval, where processing speed and memory efficiency in high-dimensional data are crucial.

Future Directions

The paper opens avenues for further enhancements in locality-sensitive hashing, suggesting future work could focus on discovering hash functions that offer high performance not through exhaustive range enumeration but through more computationally efficient methodologies.

Overall, this work represents a significant step forward in the field of LSH algorithms. It provides both a theoretical framework and practical methodology for employing LSH in applications where angular distances are pivotal. Further exploration in non-linear hash function computation could unlock even broader utilization in large-scale data processing tasks.

PDF Markdown