Papers
Topics
Authors
Recent
Search
2000 character limit reached

MinHash LSH: Efficient Similarity Search

Updated 22 May 2026
  • MinHash LSH is a randomized technique that estimates Jaccard similarity between sets using hash signatures to enable efficient approximate similarity search.
  • It partitions the signature matrix into bands to amplify differences between high and low similarity pairs, ensuring rapid candidate identification for near-duplicate detection.
  • Extensions like Asymmetric Minwise Hashing, HyperMinHash, and SuperMinHash reduce bias and variance, optimizing performance for large-scale, high-dimensional applications.

MinHash LSH (Locality Sensitive Hashing) is a pivotal randomized algorithmic framework for approximate similarity search in high-dimensional binary and set-valued data. At its core, MinHash LSH efficiently approximates the Jaccard similarity between sets or equivalent binary vectors via random hashing and enables sublinear-time retrieval of near-duplicate or highly similar pairs. The method has seen extensive application in deduplication, plagiarism detection, web-scale indexing, and is foundational in both theory and in multiple industrial-scale systems.

1. Fundamental Principles of MinHash LSH

MinHash LSH exploits the fact that the Jaccard similarity between two sets X,YX, Y is equal to the probability that a random permutation of the universe UU produces the same minimum element in both XX and YY:

J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)

where hπ(X)h_\pi(X) denotes the smallest element of XX under permutation π\pi (Jafari et al., 2021, Shrivastava et al., 2014).

For practical deployment, instead of true random permutations, one typically uses kk independent hash functions hj:U[M]h_j: U \rightarrow [M], and constructs a UU0-length signature:

UU1

The Jaccard similarity is then estimated as the fraction of positions where the two signatures coincide.

2. Classical Banding for Locality-Sensitive Hashing

The signature matrix is partitioned into UU2 bands of UU3 rows each (UU4). For each band UU5, the UU6-tuple is treated as a key; any two sets that agree in a band are candidate matches. The probability that two sets with Jaccard similarity UU7 become candidates is:

UU8

This sharply amplifies the gap between high- and low-similarity pairs, enabling efficient approximate nearest neighbor search (Jafari et al., 2021, Zhu et al., 2016).

3. Algorithmic Variants and Extensions

3.1 Asymmetric Minwise Hashing

Standard MinHash is biased towards smaller sets when set overlap (inner product) or containment is the desired measure. Asymmetric Minwise Hashing (MH-ALSH) removes this bias by transforming each set into longer binary vectors via asymmetric padding:

UU9

and further through double composition, producing XX0. The Jaccard resemblance after transformation is

XX1

making collision probability monotonic in set overlap. This yields strictly better theoretical guarantees for sublinear search in the sparse-binary regime and dominates other LSH methods for set containment (Shrivastava et al., 2014).

3.2 Generalization to Probability Distributions

For positive vectors or probability distributions XX2, MinHash LSH has been extended to a generalized similarity measure:

XX3

A sampled hash XX4 satisfies XX5, and the definition reduces exactly to set-Jaccard in the binary case (Moulton et al., 2018). Two algorithms are provided:

  • For sparse vectors: generate exponential weights and pick argmin.
  • For dense/continuous distributions: A*-sampling over a proposal measure.

This extension is scale-invariant and more sensitive to support differences than earlier weighted MinHash schemes.

3.3 Sub-logarithmic Space: HyperMinHash

Standard MinHash requires XX6 bits per hash. HyperMinHash reduces this to XX7 by a floating-point encoding of the minimum ("exponent" plus "mantissa"), providing mergeability and enabling Jaccard estimation with XX8 space for target Jaccard XX9 and error YY0. HyperMinHash supports streaming updates and unions, handling sets of size up to YY1 with moderate memory on commodity hardware (Yu et al., 2017).

3.4 Variance Reduction: SuperMinHash

SuperMinHash introduces negative dependence among signature coordinates for further variance reduction. For union sizes YY2, the variance factor YY3 yields up to YY4 tighter confidence intervals for the Jaccard estimator and accelerates signature generation, especially when YY5 (Ertl, 2017).

3.5 Braun-Blanquet Similarity and Chosen Path LSH

MinHash is suboptimal on equal-size sets or for Braun-Blanquet similarity. The Chosen Path scheme achieves a lower YY6 parameter by enforcing collision properties tuned directly for YY7, outperforming MinHash, especially when all sets are size-YY8 (Christiani et al., 2016).

4. Application Domains and Large-Scale Frameworks

MinHash LSH forms the basis of scalable deduplication and similarity search infrastructure:

  • Dataset Deduplication : FED accelerates MinHash LSH dramatically on GPU clusters using a rolling 32-bit hash and pipelined kernels, maintaining consistent deduplication quality (YY9) at up to J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)0 speedups compared to optimized CPU baselines. Hash evaluation and band grouping become J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)1 per shingle, supporting near-interactive deduplication of trillion-token corpora (Son et al., 2 Jan 2025).
  • Internet-Scale Domain Search : LSH Ensemble indexes massive sets using MinHash sketches and partitioned LSH tables, supporting set containment queries robust to domain size skew typical of web-scale data. Equi-depth partitioning approximates optimality for power-law size distributions (Zhu et al., 2016).
  • Text, Graph, and Malware Clustering : Application examples cover malware clustering and name deduplication, where MinHash (with banding) achieves order-of-magnitude speedups and high recall (Jafari et al., 2021).
  • Cardinality and Unions : HyperMinHash enables efficient join cardinality and Jaccard estimation for data streams and multi-set unions under memory constraints (Yu et al., 2017).

5. Theoretical Performance: Collision Probabilities and ρ-values

The efficiency of MinHash LSH is fundamentally characterized by the J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)2-sensitivity framework:

J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)3

For MinHash under Jaccard, J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)4, J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)5. For cosine similarity (binary vectors), crucial inequalities link the MinHash and SimHash probability curves:

J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)6

yielding J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)7, which is strictly smaller than the cosine-LSH J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)8 of SimHash for high similarity (Shrivastava et al., 2014).

For set containment and overlap, Asymmetric Minwise Hashing and Chosen Path schemes yield strictly improved J(X,Y)=XYXY,Pr[hπ(X)=hπ(Y)]=J(X,Y)J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\,, \quad \Pr[ h_\pi(X) = h_\pi(Y) ] = J(X, Y)9-values, particularly for sparse or equal-size input sets (Shrivastava et al., 2014, Christiani et al., 2016).

6. Implementation, Parameter Selection, and Practical Considerations

Efficient implementation of MinHash LSH leverages the following key aspects:

  • Signature Construction: For set hπ(X)h_\pi(X)0, hπ(X)h_\pi(X)1-length signatures via minimum hash values; computational cost is hπ(X)h_\pi(X)2.
  • Banding Parameters: Choice of hπ(X)h_\pi(X)3 bands and hπ(X)h_\pi(X)4 rows offers a tunable trade-off between recall and false positives. Empirical settings are typically hπ(X)h_\pi(X)5, hπ(X)h_\pi(X)6, hπ(X)h_\pi(X)7 (Jafari et al., 2021).
  • GPU Optimization: Optimized GPU pipelines exploit rolling (non-cryptographic) hash functions and parallel signature computation (Son et al., 2 Jan 2025).

For probability distributions, the P-MinHash algorithm for sparse or dense data ensures hπ(X)h_\pi(X)8 or optimal expected iterations (Moulton et al., 2018). In all cases, for high-dimensional and sparse data, MinHash LSH and its variants are broadly preferred due to empirical and theoretical performance advantages.

7. Relationship to Other Similarity Measures and LSH Families

MinHash LSH specializes in Jaccard similarity and outperforms SimHash for cosine similarity on sparse/binary data. It is not optimal for all similarity metrics; for Braun-Blanquet similarity or strictly equal-size sets, dedicated schemes achieve strictly better performance. MinHash LSH remains foundational, however, and is extensible both to weighted/probability-vector inputs and compressed or mergeable sketches (Shrivastava et al., 2014, Moulton et al., 2018, Yu et al., 2017).

A plausible implication is that, unless attention is restricted to dense, real-valued data or specific alternative similarity metrics, MinHash LSH is canonical and usually optimal for large-scale approximate set similarity search, both in theory and in current system deployments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinHash LSH.