MinHash Locality Sensitive Hashing (LSH)
- MinHash LSH is a randomized algorithm enabling approximate similarity search in large datasets through hash-based signatures and Jaccard index.
- This technique utilizes banding and hashing strategies to efficiently search and retrieve similar data points, important for deduplication and clustering.
- Applications of MinHash LSH span document retrieval, clustering, and more, facilitating rapid data processing and analysis in high-dimensional spaces.
MinHash Locality Sensitive Hashing (LSH) is a family of randomized algorithms foundational for approximate similarity search in high-dimensional discrete and numerical domains. The MinHash LSH framework maps input objects—traditionally sets, and more generally weighted or probabilistic objects—into lower-dimensional signatures such that the probability of hash collisions reflects a specific similarity measure (Jaccard or its extensions). This property enables sublinear-time candidate retrieval in large databases, making MinHash LSH a critical tool in deduplication, clustering, large-scale information retrieval, and modern scientific domains such as topological data analysis.
1. Foundations: Jaccard Similarity, MinHash, and LSH Structure
MinHash LSH originates from the relationship between Jaccard similarity and hash-based estimation. For sets , Jaccard similarity is
The canonical MinHash algorithm selects a random permutation of , defining . The central property is
since the minimum of under a random order lies in with probability (Jafari et al., 2021, Wang et al., 2014).
For practical scale, independent MinHash functions, each generated by independent random permutations (or universal hash functions), produce signature vectors. The fraction of agreeing coordinates across signatures is an unbiased estimator of 0.
To enable sublinear candidate retrieval, banding LSH organizes the 1-row signatures into 2 bands of 3 rows. Within each band, the 4-tuple serves as a bucket key: 5 where 6 is the true Jaccard similarity. This forms an S-curve, sharply discriminating object pairs above or below a selected threshold (Wang et al., 2014).
2. MinHash for Probability Distributions: P-MinHash and Jaccard Generalization
Classic MinHash and LSH operate on sets (indicator vectors). Extensions to positive-weighted or probabilistic data require analogues of the Jaccard index. A frequently cited extension is
7
for 8. However, 9 loses scale invariance and fails to recover set-Jaccard under normalization (Moulton et al., 2018).
P-MinHash provides a scale-invariant, Pareto-optimal sampling-based extension. For nonnegative 0, we define
1
where 2 are independent uniform 3 hash values. The resulting “exponential race” samples index 4 with probability 5. The collision probability is
6
which is scale-invariant and reduces to classic Jaccard for indicator vectors. This is the unique Pareto-optimal collision kernel for sampling-based LSH over positive vectors (Moulton et al., 2018).
3. Algorithmic Implementations: Sparse, Dense, and Set Data
Set/Binary Data
Set-based MinHash can be efficiently implemented using 7 hash functions, each applied to set members. State-of-the-art optimizations include one-permutation MinHash and densified hashing, reducing permutation or hash function costs (Wang et al., 2014).
Sparse and Weighted Data
For sparse (many 8) or explicit distributions, P-MinHash computes, for each nonzero 9,
0
returning 1. This has 2 time and is streamable. For dense/continuous data, a global A*-like search uses proposal measures and bounds to reduce computation, achieving 3 expected steps for finite supports (Moulton et al., 2018).
Extensions: Structured and Hierarchical Data
Recent work has adapted MinHash LSH to data structures such as merge trees. For example, subpath-based and recursive MinHash signatures on rooted trees produce LSHable signatures supporting scalable comparative analysis in topological data analysis. Hash-based sketches are multiset-valued and processed via q-MinHash or recursive aggregation (Lyu et al., 2024).
4. Theoretical Guarantees: Collision Probability and Optimality
Unbiasedness
The collision probability of MinHash is exactly the target similarity: 4 for sets, and analogously for 5 for probability distributions.
Pareto-Optimality
P-MinHash’s 6 is Pareto-optimal: no other sampling-based LSH can strictly increase collision probability for some pair without decreasing it for another pair with higher 7. The proof constructs auxiliary distributions and applies a pigeonhole argument across exclusive collision events (Moulton et al., 2018).
Embeddability
Banding schemes using MinHash produce S-shaped candidate curves sharply focusing on pairs above the chosen threshold. Both set-based and generalized MinHash LSH have formal embedding in 8, with 9 as a metric (Lyu et al., 2024).
5. Parameterization and Practical Guidance
Recommended signature lengths are 0 in the range 1–2, divided into 3 bands of 4 rows (e.g., 5) to tune the retrieval threshold. For set data, recent implementations employ b-bit MinHash (e.g., 6), densified sketches, and single-permutation hashing to reduce storage and preprocessing.
For P-MinHash on web-scale sparse data, it is typical to concatenate 7–8 independent hashes, using 64-bit hash functions (e.g., splitmix64, xxHash). The number of output hash keys controls the trade-off between collision rates and recall (Moulton et al., 2018).
6. Empirical Performance and Applications
Extensive empirical studies confirm MinHash LSH's effectiveness for duplicate and near-duplicate detection, large-scale document retrieval, and clustering (Jafari et al., 2021, Shrivastava et al., 2014, Wang et al., 2014, Moulton et al., 2018).
When compared to SimHash, MinHash offers superior candidate reduction and retrieval precision, especially for high similarity search tasks on sparse binary data. MinHash maintains a lower gap constant 9 in the LSH time bound 0 for approximate neighbor search, ensuring more efficient search (Shrivastava et al., 2014).
In web-scale tasks, P-MinHash yields higher collision rates on important data pairs, leading to notably reduced database accesses (e.g., achieving target recall with half the OR-lookups needed by weighted MinHash). The extension to merge trees and other structured data enables orders-of-magnitude speedups over edit distance or geometric methods, achieving near-linear scalability and interactive responsiveness for scientific workflows (Moulton et al., 2018, Lyu et al., 2024).
7. Extensions and Open Directions
Recent deployments extend MinHash LSH beyond sets and vectors, including rooted trees (merge trees, subpath signatures), multisets, and arbitrary positive functions subject to measure-theoretic constraints (Lyu et al., 2024). The general principle remains: define a collision kernel based on an interpretable similarity measure, derive (when possible) a maximally consistent or optimal sampling-based hash, and tune hash concatenation and banding for application-dependent thresholds.
While MinHash LSH remains dominant for set and weighted set similarity search, possibilities include further compression (b-bit or hashed sketches), deeper structure-adaptive hierarchies, and adaptation to different data-dependent similarity metrics within the LSH framework, preserving unbiasedness and collision correspondence.
References:
(Moulton et al., 2018) "Maximally Consistent Sampling and the Jaccard Index of Probability Distributions" (Jafari et al., 2021) "A Survey on Locality Sensitive Hashing Algorithms and their Applications" (Wang et al., 2014) "Hashing for Similarity Search: A Survey" (Shrivastava et al., 2014) "In Defense of MinHash Over SimHash" (Lyu et al., 2024) "Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing"