Locality Sensitive Hashing (LSH) Overview

Updated 1 December 2025

Locality Sensitive Hashing is a randomized technique that maps similar high-dimensional data points into the same hash buckets, ensuring efficient approximate nearest neighbor search.
It utilizes specific LSH families such as p-Stable for Euclidean distance, SimHash for cosine similarity, and MinHash for Jaccard similarity to optimize collision probabilities.
Advanced LSH variants and distributed frameworks enhance scalability and speed, making it ideal for applications in machine learning, image retrieval, and multimodal data analysis.

Locality Sensitive Hashing (LSH) is a data-independent randomized framework for efficient approximate nearest neighbor (ANN) search in high-dimensional spaces. LSH enables sublinear query time with rigorous theoretical guarantees by hashing input items so that similar objects are mapped to the same hash bucket with higher probability than dissimilar ones. The framework has catalyzed both theoretical analysis and high-impact applications in large-scale similarity search, information retrieval, machine learning, and data mining.

1. Formal Definition and Theoretical Guarantees

Let $(\mathcal{X}, d)$ be a metric space. A family of hash functions $\mathcal{H}$ is called $(r, cr, p_1, p_2)$ -sensitive for some distance threshold $r > 0$ and approximation factor $c > 1$ if for all $x, y \in \mathcal{X}$ , the following holds:

If $d(x, y) \leq r$ , then $\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$
If $d(x, y) \geq c r$ , then $\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$ with $p_1 > p_2$ .

The LSH exponent, $\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} < 1$ , governs the sublinear query complexity and is minimized when the gap between $p_1$ and $p_2$ is widest. Query time is $O(n^\rho \log n)$ and space usage is $O(n^{1+\rho})$ for $n$ data points (Wang et al., 2014).

2. Canonical LSH Constructions

Several LSH families are tailored to specific metrics:

$p$ -Stable LSH for Euclidean Distance: Hash functions of the form $h_{a, b}(x) = \lfloor (a^T x + b)/w \rfloor$ with $a \sim N(0, I_d), b \sim \mathrm{Unif}[0, w]$ yield standard $(r, cr, p_1, p_2)$ -sensitivity for $\ell_2$ (Nakanishi et al., 25 Mar 2024).
SimHash for Cosine Similarity: $h_{a}(x) = \mathrm{sign}(a^T x)$ , with $a$ drawn from the standard normal. Collision probability is $1 - \frac{\theta(x, y)}{\pi}$ , where $\theta(x, y)$ is the angle between $x$ and $y$ (Wang et al., 2014).
MinHash for Jaccard Similarity: For sets, $h(S) = \min\{\pi(i): i \in S\}$ for random permutation $\pi$ , with collision probability equal to the Jaccard index.

Advanced families include cross-polytope LSH (optimal for angular search), multi-probe LSH (reducing the number of required tables), and bilinear or tensorized projections for high-order structured data (Verma et al., 11 Feb 2024, Kim et al., 2015).

3. Algorithmic Frameworks and Index Structures

The classical Indyk-Motwani LSH framework constructs $L = O(n^\rho)$ hash tables, each based on $k = O(\log n)$ projections to reduce false positives among distant points. Each data point is inserted into all tables using compound hash keys, and queries probe all $L$ buckets corresponding to the query’s hash keys (Christiani, 2017).

Indexing and Query Process:

Choose $k$ and $L$ based on data size $n$ and LSH family parameters $(p_1, p_2)$ .
Build $L$ hash tables with $k$ -wise concatenation of base hash functions.
Insert each database point into the corresponding bucket of each hash table.
For a query, compute its $L$ hash keys and retrieve candidate points from all matching buckets. Exact distances are computed to select the ANN(s).

Efficient amplification is achieved via careful parameterization; for constant query success probability, $k \approx \ln n / \ln(1/p_2)$ , $L \approx n^{\rho}$ (Wang et al., 2014, Christiani, 2017).

4. Variants, Extensions, and Distributed Architectures

Scalability and Distributed LSH

In distributed settings, classical LSH suffers from high index size ( $O(n^{1+\rho})$ ) and network costs proportional to $L$ . Entropy LSH reduces the number of tables by querying randomly offset points but increases network calls per query (Bahmani et al., 2012). Layered LSH introduces a two-level hashing scheme: first mapping data with core $k$ -wise projections $H$ , then grouping $k$ -hash buckets via a second-level LSH $G$ . This achieves substantially fewer network calls per query ( $O(\sqrt{\log n})$ ) and better load balance:

Scheme	Shuffle Size (MB)	Runtime (s)	Max Load Factor
Simple LSH	150	480	3×
Layered LSH	14	160	1.5×

(Bahmani et al., 2012)

Space, Hash Cost, and Acceleration

FastLSH accelerates hash computation via random subsampling of dimensions, reducing per-hash time from $O(d)$ to $O(m)$ for an $m$ -sized subset. Theoretical LSH guarantees (monotonicity of collision probability in distance, correct exponent $\rho$ ) are preserved, and empirical evaluation demonstrates up to 80x speedup in hash evaluation without loss in retrieval accuracy (Tan et al., 2023).

Structured and Multimodal Data

LSH is generalized to accommodate matrices, tensors, or multiple feature groups through bilinear, CP-decomposition, or tensor-train random projections. These factorized hash functions preserve collision probability guarantees while reducing the memory/computation required for high-order data, making LSH feasible for large-scale problems in imaging and signal processing (Verma et al., 11 Feb 2024, Kim et al., 2015).

Specialized and Learned Extensions

Data-dependent LSH: Learns hash functions adapted to input data structure for improved empirical performance, at the cost of intensive preprocessing (Wang et al., 2014).
Neural LSH: Employs a learned neural hash function trained with an LSH-inspired loss to align collision probability with complex, task-specific metrics (Wang et al., 31 Jan 2024).
Distance-Sensitive Hashing (DSH): Generalizes LSH to allow tunable, even non-monotonic, collision probability functions over distance, supporting output-sensitive or privacy-preserving queries (Aumüller et al., 2017).