MinHash LSH: Efficient Similarity Search
- MinHash LSH is a randomized technique that estimates Jaccard similarity between sets using hash signatures to enable efficient approximate similarity search.
- It partitions the signature matrix into bands to amplify differences between high and low similarity pairs, ensuring rapid candidate identification for near-duplicate detection.
- Extensions like Asymmetric Minwise Hashing, HyperMinHash, and SuperMinHash reduce bias and variance, optimizing performance for large-scale, high-dimensional applications.
MinHash LSH (Locality Sensitive Hashing) is a pivotal randomized algorithmic framework for approximate similarity search in high-dimensional binary and set-valued data. At its core, MinHash LSH efficiently approximates the Jaccard similarity between sets or equivalent binary vectors via random hashing and enables sublinear-time retrieval of near-duplicate or highly similar pairs. The method has seen extensive application in deduplication, plagiarism detection, web-scale indexing, and is foundational in both theory and in multiple industrial-scale systems.
1. Fundamental Principles of MinHash LSH
MinHash LSH exploits the fact that the Jaccard similarity between two sets is equal to the probability that a random permutation of the universe produces the same minimum element in both and :
where denotes the smallest element of under permutation (Jafari et al., 2021, Shrivastava et al., 2014).
For practical deployment, instead of true random permutations, one typically uses independent hash functions , and constructs a 0-length signature:
1
The Jaccard similarity is then estimated as the fraction of positions where the two signatures coincide.
2. Classical Banding for Locality-Sensitive Hashing
The signature matrix is partitioned into 2 bands of 3 rows each (4). For each band 5, the 6-tuple is treated as a key; any two sets that agree in a band are candidate matches. The probability that two sets with Jaccard similarity 7 become candidates is:
8
This sharply amplifies the gap between high- and low-similarity pairs, enabling efficient approximate nearest neighbor search (Jafari et al., 2021, Zhu et al., 2016).
3. Algorithmic Variants and Extensions
3.1 Asymmetric Minwise Hashing
Standard MinHash is biased towards smaller sets when set overlap (inner product) or containment is the desired measure. Asymmetric Minwise Hashing (MH-ALSH) removes this bias by transforming each set into longer binary vectors via asymmetric padding:
9
and further through double composition, producing 0. The Jaccard resemblance after transformation is
1
making collision probability monotonic in set overlap. This yields strictly better theoretical guarantees for sublinear search in the sparse-binary regime and dominates other LSH methods for set containment (Shrivastava et al., 2014).
3.2 Generalization to Probability Distributions
For positive vectors or probability distributions 2, MinHash LSH has been extended to a generalized similarity measure:
3
A sampled hash 4 satisfies 5, and the definition reduces exactly to set-Jaccard in the binary case (Moulton et al., 2018). Two algorithms are provided:
- For sparse vectors: generate exponential weights and pick argmin.
- For dense/continuous distributions: A*-sampling over a proposal measure.
This extension is scale-invariant and more sensitive to support differences than earlier weighted MinHash schemes.
3.3 Sub-logarithmic Space: HyperMinHash
Standard MinHash requires 6 bits per hash. HyperMinHash reduces this to 7 by a floating-point encoding of the minimum ("exponent" plus "mantissa"), providing mergeability and enabling Jaccard estimation with 8 space for target Jaccard 9 and error 0. HyperMinHash supports streaming updates and unions, handling sets of size up to 1 with moderate memory on commodity hardware (Yu et al., 2017).
3.4 Variance Reduction: SuperMinHash
SuperMinHash introduces negative dependence among signature coordinates for further variance reduction. For union sizes 2, the variance factor 3 yields up to 4 tighter confidence intervals for the Jaccard estimator and accelerates signature generation, especially when 5 (Ertl, 2017).
3.5 Braun-Blanquet Similarity and Chosen Path LSH
MinHash is suboptimal on equal-size sets or for Braun-Blanquet similarity. The Chosen Path scheme achieves a lower 6 parameter by enforcing collision properties tuned directly for 7, outperforming MinHash, especially when all sets are size-8 (Christiani et al., 2016).
4. Application Domains and Large-Scale Frameworks
MinHash LSH forms the basis of scalable deduplication and similarity search infrastructure:
- Dataset Deduplication : FED accelerates MinHash LSH dramatically on GPU clusters using a rolling 32-bit hash and pipelined kernels, maintaining consistent deduplication quality (9) at up to 0 speedups compared to optimized CPU baselines. Hash evaluation and band grouping become 1 per shingle, supporting near-interactive deduplication of trillion-token corpora (Son et al., 2 Jan 2025).
- Internet-Scale Domain Search : LSH Ensemble indexes massive sets using MinHash sketches and partitioned LSH tables, supporting set containment queries robust to domain size skew typical of web-scale data. Equi-depth partitioning approximates optimality for power-law size distributions (Zhu et al., 2016).
- Text, Graph, and Malware Clustering : Application examples cover malware clustering and name deduplication, where MinHash (with banding) achieves order-of-magnitude speedups and high recall (Jafari et al., 2021).
- Cardinality and Unions : HyperMinHash enables efficient join cardinality and Jaccard estimation for data streams and multi-set unions under memory constraints (Yu et al., 2017).
5. Theoretical Performance: Collision Probabilities and ρ-values
The efficiency of MinHash LSH is fundamentally characterized by the 2-sensitivity framework:
3
For MinHash under Jaccard, 4, 5. For cosine similarity (binary vectors), crucial inequalities link the MinHash and SimHash probability curves:
6
yielding 7, which is strictly smaller than the cosine-LSH 8 of SimHash for high similarity (Shrivastava et al., 2014).
For set containment and overlap, Asymmetric Minwise Hashing and Chosen Path schemes yield strictly improved 9-values, particularly for sparse or equal-size input sets (Shrivastava et al., 2014, Christiani et al., 2016).
6. Implementation, Parameter Selection, and Practical Considerations
Efficient implementation of MinHash LSH leverages the following key aspects:
- Signature Construction: For set 0, 1-length signatures via minimum hash values; computational cost is 2.
- Banding Parameters: Choice of 3 bands and 4 rows offers a tunable trade-off between recall and false positives. Empirical settings are typically 5, 6, 7 (Jafari et al., 2021).
- GPU Optimization: Optimized GPU pipelines exploit rolling (non-cryptographic) hash functions and parallel signature computation (Son et al., 2 Jan 2025).
For probability distributions, the P-MinHash algorithm for sparse or dense data ensures 8 or optimal expected iterations (Moulton et al., 2018). In all cases, for high-dimensional and sparse data, MinHash LSH and its variants are broadly preferred due to empirical and theoretical performance advantages.
7. Relationship to Other Similarity Measures and LSH Families
MinHash LSH specializes in Jaccard similarity and outperforms SimHash for cosine similarity on sparse/binary data. It is not optimal for all similarity metrics; for Braun-Blanquet similarity or strictly equal-size sets, dedicated schemes achieve strictly better performance. MinHash LSH remains foundational, however, and is extensible both to weighted/probability-vector inputs and compressed or mergeable sketches (Shrivastava et al., 2014, Moulton et al., 2018, Yu et al., 2017).
A plausible implication is that, unless attention is restricted to dense, real-valued data or specific alternative similarity metrics, MinHash LSH is canonical and usually optimal for large-scale approximate set similarity search, both in theory and in current system deployments.