Locality Sensitive Hashing (LSH) Overview
- Locality Sensitive Hashing is a randomized technique that maps similar high-dimensional data points into the same hash buckets, ensuring efficient approximate nearest neighbor search.
- It utilizes specific LSH families such as p-Stable for Euclidean distance, SimHash for cosine similarity, and MinHash for Jaccard similarity to optimize collision probabilities.
- Advanced LSH variants and distributed frameworks enhance scalability and speed, making it ideal for applications in machine learning, image retrieval, and multimodal data analysis.
Locality Sensitive Hashing (LSH) is a data-independent randomized framework for efficient approximate nearest neighbor (ANN) search in high-dimensional spaces. LSH enables sublinear query time with rigorous theoretical guarantees by hashing input items so that similar objects are mapped to the same hash bucket with higher probability than dissimilar ones. The framework has catalyzed both theoretical analysis and high-impact applications in large-scale similarity search, information retrieval, machine learning, and data mining.
1. Formal Definition and Theoretical Guarantees
Let be a metric space. A family of hash functions is called -sensitive for some distance threshold and approximation factor if for all , the following holds:
- If , then
- If , then with .
The LSH exponent, , governs the sublinear query complexity and is minimized when the gap between and is widest. Query time is and space usage is for data points (Wang et al., 2014).
2. Canonical LSH Constructions
Several LSH families are tailored to specific metrics:
- -Stable LSH for Euclidean Distance: Hash functions of the form with yield standard -sensitivity for (Nakanishi et al., 25 Mar 2024).
- SimHash for Cosine Similarity: , with drawn from the standard normal. Collision probability is , where is the angle between and (Wang et al., 2014).
- MinHash for Jaccard Similarity: For sets, for random permutation , with collision probability equal to the Jaccard index.
Advanced families include cross-polytope LSH (optimal for angular search), multi-probe LSH (reducing the number of required tables), and bilinear or tensorized projections for high-order structured data (Verma et al., 11 Feb 2024, Kim et al., 2015).
3. Algorithmic Frameworks and Index Structures
The classical Indyk-Motwani LSH framework constructs hash tables, each based on projections to reduce false positives among distant points. Each data point is inserted into all tables using compound hash keys, and queries probe all buckets corresponding to the query’s hash keys (Christiani, 2017).
Indexing and Query Process:
- Choose and based on data size and LSH family parameters .
- Build hash tables with -wise concatenation of base hash functions.
- Insert each database point into the corresponding bucket of each hash table.
- For a query, compute its hash keys and retrieve candidate points from all matching buckets. Exact distances are computed to select the ANN(s).
Efficient amplification is achieved via careful parameterization; for constant query success probability, , (Wang et al., 2014, Christiani, 2017).
4. Variants, Extensions, and Distributed Architectures
Scalability and Distributed LSH
In distributed settings, classical LSH suffers from high index size () and network costs proportional to . Entropy LSH reduces the number of tables by querying randomly offset points but increases network calls per query (Bahmani et al., 2012). Layered LSH introduces a two-level hashing scheme: first mapping data with core -wise projections , then grouping -hash buckets via a second-level LSH . This achieves substantially fewer network calls per query () and better load balance:
| Scheme | Shuffle Size (MB) | Runtime (s) | Max Load Factor |
|---|---|---|---|
| Simple LSH | 150 | 480 | 3× |
| Layered LSH | 14 | 160 | 1.5× |
Space, Hash Cost, and Acceleration
FastLSH accelerates hash computation via random subsampling of dimensions, reducing per-hash time from to for an -sized subset. Theoretical LSH guarantees (monotonicity of collision probability in distance, correct exponent ) are preserved, and empirical evaluation demonstrates up to 80x speedup in hash evaluation without loss in retrieval accuracy (Tan et al., 2023).
Structured and Multimodal Data
LSH is generalized to accommodate matrices, tensors, or multiple feature groups through bilinear, CP-decomposition, or tensor-train random projections. These factorized hash functions preserve collision probability guarantees while reducing the memory/computation required for high-order data, making LSH feasible for large-scale problems in imaging and signal processing (Verma et al., 11 Feb 2024, Kim et al., 2015).
Specialized and Learned Extensions
- Data-dependent LSH: Learns hash functions adapted to input data structure for improved empirical performance, at the cost of intensive preprocessing (Wang et al., 2014).
- Neural LSH: Employs a learned neural hash function trained with an LSH-inspired loss to align collision probability with complex, task-specific metrics (Wang et al., 31 Jan 2024).
- Distance-Sensitive Hashing (DSH): Generalizes LSH to allow tunable, even non-monotonic, collision probability functions over distance, supporting output-sensitive or privacy-preserving queries (Aumüller et al., 2017).
5. Applications and Empirical Performance
LSH and its variants are foundational in approximate similarity search for high-dimensional datasets across domains:
- Vision: Near-duplicate detection and content-based image retrieval with SIFT,