Papers
Topics
Authors
Recent
Search
2000 character limit reached

MinHash Locality Sensitive Hashing (LSH)

Updated 7 May 2026
  • MinHash LSH is a randomized algorithm enabling approximate similarity search in large datasets through hash-based signatures and Jaccard index.
  • This technique utilizes banding and hashing strategies to efficiently search and retrieve similar data points, important for deduplication and clustering.
  • Applications of MinHash LSH span document retrieval, clustering, and more, facilitating rapid data processing and analysis in high-dimensional spaces.

MinHash Locality Sensitive Hashing (LSH) is a family of randomized algorithms foundational for approximate similarity search in high-dimensional discrete and numerical domains. The MinHash LSH framework maps input objects—traditionally sets, and more generally weighted or probabilistic objects—into lower-dimensional signatures such that the probability of hash collisions reflects a specific similarity measure (Jaccard or its extensions). This property enables sublinear-time candidate retrieval in large databases, making MinHash LSH a critical tool in deduplication, clustering, large-scale information retrieval, and modern scientific domains such as topological data analysis.

1. Foundations: Jaccard Similarity, MinHash, and LSH Structure

MinHash LSH originates from the relationship between Jaccard similarity and hash-based estimation. For sets A,BUA,B\subseteq U, Jaccard similarity is

J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.

The canonical MinHash algorithm selects a random permutation π\pi of UU, defining hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}. The central property is

Pr[hπ(A)=hπ(B)]=J(A,B),\Pr[h_\pi(A) = h_\pi(B)] = J(A,B),

since the minimum of ABA \cup B under a random order lies in ABA\cap B with probability AB/AB|A\cap B|/|A\cup B| (Jafari et al., 2021, Wang et al., 2014).

For practical scale, kk independent MinHash functions, each generated by independent random permutations (or universal hash functions), produce signature vectors. The fraction of agreeing coordinates across signatures is an unbiased estimator of J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.0.

To enable sublinear candidate retrieval, banding LSH organizes the J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.1-row signatures into J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.2 bands of J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.3 rows. Within each band, the J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.4-tuple serves as a bucket key: J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.5 where J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.6 is the true Jaccard similarity. This forms an S-curve, sharply discriminating object pairs above or below a selected threshold (Wang et al., 2014).

2. MinHash for Probability Distributions: P-MinHash and Jaccard Generalization

Classic MinHash and LSH operate on sets (indicator vectors). Extensions to positive-weighted or probabilistic data require analogues of the Jaccard index. A frequently cited extension is

J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.7

for J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.8. However, J(A,B)=ABAB.J(A,B) = \frac{|A\cap B|}{|A\cup B|}.9 loses scale invariance and fails to recover set-Jaccard under normalization (Moulton et al., 2018).

P-MinHash provides a scale-invariant, Pareto-optimal sampling-based extension. For nonnegative π\pi0, we define

π\pi1

where π\pi2 are independent uniform π\pi3 hash values. The resulting “exponential race” samples index π\pi4 with probability π\pi5. The collision probability is

π\pi6

which is scale-invariant and reduces to classic Jaccard for indicator vectors. This is the unique Pareto-optimal collision kernel for sampling-based LSH over positive vectors (Moulton et al., 2018).

3. Algorithmic Implementations: Sparse, Dense, and Set Data

Set/Binary Data

Set-based MinHash can be efficiently implemented using π\pi7 hash functions, each applied to set members. State-of-the-art optimizations include one-permutation MinHash and densified hashing, reducing permutation or hash function costs (Wang et al., 2014).

Sparse and Weighted Data

For sparse (many π\pi8) or explicit distributions, P-MinHash computes, for each nonzero π\pi9,

UU0

returning UU1. This has UU2 time and is streamable. For dense/continuous data, a global A*-like search uses proposal measures and bounds to reduce computation, achieving UU3 expected steps for finite supports (Moulton et al., 2018).

Extensions: Structured and Hierarchical Data

Recent work has adapted MinHash LSH to data structures such as merge trees. For example, subpath-based and recursive MinHash signatures on rooted trees produce LSHable signatures supporting scalable comparative analysis in topological data analysis. Hash-based sketches are multiset-valued and processed via q-MinHash or recursive aggregation (Lyu et al., 2024).

4. Theoretical Guarantees: Collision Probability and Optimality

Unbiasedness

The collision probability of MinHash is exactly the target similarity: UU4 for sets, and analogously for UU5 for probability distributions.

Pareto-Optimality

P-MinHash’s UU6 is Pareto-optimal: no other sampling-based LSH can strictly increase collision probability for some pair without decreasing it for another pair with higher UU7. The proof constructs auxiliary distributions and applies a pigeonhole argument across exclusive collision events (Moulton et al., 2018).

Embeddability

Banding schemes using MinHash produce S-shaped candidate curves sharply focusing on pairs above the chosen threshold. Both set-based and generalized MinHash LSH have formal embedding in UU8, with UU9 as a metric (Lyu et al., 2024).

5. Parameterization and Practical Guidance

Recommended signature lengths are hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}0 in the range hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}1–hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}2, divided into hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}3 bands of hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}4 rows (e.g., hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}5) to tune the retrieval threshold. For set data, recent implementations employ b-bit MinHash (e.g., hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}6), densified sketches, and single-permutation hashing to reduce storage and preprocessing.

For P-MinHash on web-scale sparse data, it is typical to concatenate hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}7–hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}8 independent hashes, using 64-bit hash functions (e.g., splitmix64, xxHash). The number of output hash keys controls the trade-off between collision rates and recall (Moulton et al., 2018).

6. Empirical Performance and Applications

Extensive empirical studies confirm MinHash LSH's effectiveness for duplicate and near-duplicate detection, large-scale document retrieval, and clustering (Jafari et al., 2021, Shrivastava et al., 2014, Wang et al., 2014, Moulton et al., 2018).

When compared to SimHash, MinHash offers superior candidate reduction and retrieval precision, especially for high similarity search tasks on sparse binary data. MinHash maintains a lower gap constant hπ(A)=min{π(x):xA}h_\pi(A) = \min\{\pi(x) : x \in A\}9 in the LSH time bound Pr[hπ(A)=hπ(B)]=J(A,B),\Pr[h_\pi(A) = h_\pi(B)] = J(A,B),0 for approximate neighbor search, ensuring more efficient search (Shrivastava et al., 2014).

In web-scale tasks, P-MinHash yields higher collision rates on important data pairs, leading to notably reduced database accesses (e.g., achieving target recall with half the OR-lookups needed by weighted MinHash). The extension to merge trees and other structured data enables orders-of-magnitude speedups over edit distance or geometric methods, achieving near-linear scalability and interactive responsiveness for scientific workflows (Moulton et al., 2018, Lyu et al., 2024).

7. Extensions and Open Directions

Recent deployments extend MinHash LSH beyond sets and vectors, including rooted trees (merge trees, subpath signatures), multisets, and arbitrary positive functions subject to measure-theoretic constraints (Lyu et al., 2024). The general principle remains: define a collision kernel based on an interpretable similarity measure, derive (when possible) a maximally consistent or optimal sampling-based hash, and tune hash concatenation and banding for application-dependent thresholds.

While MinHash LSH remains dominant for set and weighted set similarity search, possibilities include further compression (b-bit or hashed sketches), deeper structure-adaptive hierarchies, and adaptation to different data-dependent similarity metrics within the LSH framework, preserving unbiasedness and collision correspondence.


References:

(Moulton et al., 2018) "Maximally Consistent Sampling and the Jaccard Index of Probability Distributions" (Jafari et al., 2021) "A Survey on Locality Sensitive Hashing Algorithms and their Applications" (Wang et al., 2014) "Hashing for Similarity Search: A Survey" (Shrivastava et al., 2014) "In Defense of MinHash Over SimHash" (Lyu et al., 2024) "Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinHash Locality Sensitive Hashing (LSH).