Papers
Topics
Authors
Recent
2000 character limit reached

Locality Sensitive Hashing (LSH) Overview

Updated 1 December 2025
  • Locality Sensitive Hashing is a randomized technique that maps similar high-dimensional data points into the same hash buckets, ensuring efficient approximate nearest neighbor search.
  • It utilizes specific LSH families such as p-Stable for Euclidean distance, SimHash for cosine similarity, and MinHash for Jaccard similarity to optimize collision probabilities.
  • Advanced LSH variants and distributed frameworks enhance scalability and speed, making it ideal for applications in machine learning, image retrieval, and multimodal data analysis.

Locality Sensitive Hashing (LSH) is a data-independent randomized framework for efficient approximate nearest neighbor (ANN) search in high-dimensional spaces. LSH enables sublinear query time with rigorous theoretical guarantees by hashing input items so that similar objects are mapped to the same hash bucket with higher probability than dissimilar ones. The framework has catalyzed both theoretical analysis and high-impact applications in large-scale similarity search, information retrieval, machine learning, and data mining.

1. Formal Definition and Theoretical Guarantees

Let (X,d)(\mathcal{X}, d) be a metric space. A family of hash functions H\mathcal{H} is called (r,cr,p1,p2)(r, cr, p_1, p_2)-sensitive for some distance threshold r>0r > 0 and approximation factor c>1c > 1 if for all x,yXx, y \in \mathcal{X}, the following holds:

  • If d(x,y)rd(x, y) \leq r, then PrhH[h(x)=h(y)]p1\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1
  • If d(x,y)crd(x, y) \geq c r, then PrhH[h(x)=h(y)]p2\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2 with p1>p2p_1 > p_2.

The LSH exponent, ρ=ln(1/p1)ln(1/p2)<1\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} < 1, governs the sublinear query complexity and is minimized when the gap between p1p_1 and p2p_2 is widest. Query time is O(nρlogn)O(n^\rho \log n) and space usage is O(n1+ρ)O(n^{1+\rho}) for nn data points (Wang et al., 2014).

2. Canonical LSH Constructions

Several LSH families are tailored to specific metrics:

  • pp-Stable LSH for Euclidean Distance: Hash functions of the form ha,b(x)=(aTx+b)/wh_{a, b}(x) = \lfloor (a^T x + b)/w \rfloor with aN(0,Id),bUnif[0,w]a \sim N(0, I_d), b \sim \mathrm{Unif}[0, w] yield standard (r,cr,p1,p2)(r, cr, p_1, p_2)-sensitivity for 2\ell_2 (Nakanishi et al., 25 Mar 2024).
  • SimHash for Cosine Similarity: ha(x)=sign(aTx)h_{a}(x) = \mathrm{sign}(a^T x), with aa drawn from the standard normal. Collision probability is 1θ(x,y)π1 - \frac{\theta(x, y)}{\pi}, where θ(x,y)\theta(x, y) is the angle between xx and yy (Wang et al., 2014).
  • MinHash for Jaccard Similarity: For sets, h(S)=min{π(i):iS}h(S) = \min\{\pi(i): i \in S\} for random permutation π\pi, with collision probability equal to the Jaccard index.

Advanced families include cross-polytope LSH (optimal for angular search), multi-probe LSH (reducing the number of required tables), and bilinear or tensorized projections for high-order structured data (Verma et al., 11 Feb 2024, Kim et al., 2015).

3. Algorithmic Frameworks and Index Structures

The classical Indyk-Motwani LSH framework constructs L=O(nρ)L = O(n^\rho) hash tables, each based on k=O(logn)k = O(\log n) projections to reduce false positives among distant points. Each data point is inserted into all tables using compound hash keys, and queries probe all LL buckets corresponding to the query’s hash keys (Christiani, 2017).

Indexing and Query Process:

  1. Choose kk and LL based on data size nn and LSH family parameters (p1,p2)(p_1, p_2).
  2. Build LL hash tables with kk-wise concatenation of base hash functions.
  3. Insert each database point into the corresponding bucket of each hash table.
  4. For a query, compute its LL hash keys and retrieve candidate points from all matching buckets. Exact distances are computed to select the ANN(s).

Efficient amplification is achieved via careful parameterization; for constant query success probability, klnn/ln(1/p2)k \approx \ln n / \ln(1/p_2), LnρL \approx n^{\rho} (Wang et al., 2014, Christiani, 2017).

4. Variants, Extensions, and Distributed Architectures

Scalability and Distributed LSH

In distributed settings, classical LSH suffers from high index size (O(n1+ρ)O(n^{1+\rho})) and network costs proportional to LL. Entropy LSH reduces the number of tables by querying randomly offset points but increases network calls per query (Bahmani et al., 2012). Layered LSH introduces a two-level hashing scheme: first mapping data with core kk-wise projections HH, then grouping kk-hash buckets via a second-level LSH GG. This achieves substantially fewer network calls per query (O(logn)O(\sqrt{\log n})) and better load balance:

Scheme Shuffle Size (MB) Runtime (s) Max Load Factor
Simple LSH 150 480
Layered LSH 14 160 1.5×

(Bahmani et al., 2012)

Space, Hash Cost, and Acceleration

FastLSH accelerates hash computation via random subsampling of dimensions, reducing per-hash time from O(d)O(d) to O(m)O(m) for an mm-sized subset. Theoretical LSH guarantees (monotonicity of collision probability in distance, correct exponent ρ\rho) are preserved, and empirical evaluation demonstrates up to 80x speedup in hash evaluation without loss in retrieval accuracy (Tan et al., 2023).

Structured and Multimodal Data

LSH is generalized to accommodate matrices, tensors, or multiple feature groups through bilinear, CP-decomposition, or tensor-train random projections. These factorized hash functions preserve collision probability guarantees while reducing the memory/computation required for high-order data, making LSH feasible for large-scale problems in imaging and signal processing (Verma et al., 11 Feb 2024, Kim et al., 2015).

Specialized and Learned Extensions

  • Data-dependent LSH: Learns hash functions adapted to input data structure for improved empirical performance, at the cost of intensive preprocessing (Wang et al., 2014).
  • Neural LSH: Employs a learned neural hash function trained with an LSH-inspired loss to align collision probability with complex, task-specific metrics (Wang et al., 31 Jan 2024).
  • Distance-Sensitive Hashing (DSH): Generalizes LSH to allow tunable, even non-monotonic, collision probability functions over distance, supporting output-sensitive or privacy-preserving queries (Aumüller et al., 2017).

5. Applications and Empirical Performance

LSH and its variants are foundational in approximate similarity search for high-dimensional datasets across domains:

  • Vision: Near-duplicate detection and content-based image retrieval with SIFT,
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Locality Sensitive Hashing (LSH).