MinHash–Jaccard Criterion
- MinHash–Jaccard criterion is a method that equates the collision probability of hash signatures with Jaccard similarity, applicable to sets, multisets, and spatial objects.
- It provides unbiased estimators using independent hash functions and incorporates variance reduction techniques for improved accuracy in similarity estimation.
- Widely used in data mining and information retrieval, it supports approximate nearest neighbor search, duplicate detection, and efficient sketching in large-scale systems.
The MinHash–Jaccard criterion underpins a widely used family of techniques for efficient set and similarity search in large-scale data mining and information retrieval. It is centered on the equivalence between the probability that two sets, polygons, or weighted objects produce the same MinHash value, and their Jaccard (resemblance) similarity. This criterion immediately yields unbiased estimators for Jaccard similarity and enables locality-sensitive hashing (LSH) constructions, ANN search, and a variety of advanced sketching and sampling methodologies. The criterion has been extended from unweighted sets to weighted multisets, continuous measures, and even domain-specific objects such as polygons in spatial databases (Subedi et al., 20 Nov 2025, Wu et al., 2018, Ertl, 2017, Li et al., 2021, Dahlgaard et al., 2017, Zadeh et al., 2012, Subedi et al., 20 Nov 2025, Moulton et al., 2018).
1. Formal Definition of the MinHash–Jaccard Criterion
Let and be two objects (typically finite sets, multisets, or measurable regions) in a universe . Their Jaccard similarity is defined as
For polygons (), we have
A MinHash function maps sets to signatures such that
For binary sets, a classical MinHash applies a random permutation to and sets . For polygons, MinHash is instantiated by sampling uniform random points in a bounding rectangle and counting the number of samples until the first falls within the object (Subedi et al., 20 Nov 2025).
2. Collision Probability and Estimation Properties
The core theoretical result is that for independent hash functions (or sampling seeds), the collision probability for any single MinHash coordinate equals the Jaccard similarity:
for all (Subedi et al., 20 Nov 2025, Li et al., 2021, Wu et al., 2018).
Given independent MinHash components, define as the indicator that the th coordinate matches, so
provides an unbiased estimator:
Concentration bounds follow from Hoeffding's or Chernoff's inequality. For any ,
(Subedi et al., 20 Nov 2025, Dahlgaard et al., 2017, Wu et al., 2018).
This property extends to generalized criteria, e.g., generalized Jaccard for weighted sets, polygonal intersection for spatial objects, and probability distribution analogues (Wu et al., 2018, Moulton et al., 2018).
3. Algorithmic Realizations: Classical, PolyMinHash, and Advanced Schemes
Standard MinHash
For hash functions, sketch each set as with . Jaccard estimation reduces to fraction of coordinate-wise collisions (Wu et al., 2018).
PolyMinHash for Area-based Similarity
For a polygon within bounding rectangle :
- For each of seeds, repeatedly sample
- Count attempts until a sampled point lands in
- is the attempt number for the th seed Sigantures are compared via collision counts; collision probability matches area-based Jaccard (Subedi et al., 20 Nov 2025). See pseudocode in original for details.
Weighted and Probabilistic Extensions
Weighted MinHash via Consistent Weighted Sampling (CWS) and variants produces unbiased estimators of weighted Jaccard similarity:
Implementations such as ICWS, PCWS, 0-bit CWS, and ICWS provide efficiency/accuracy trade-offs in large-scale settings (Wu et al., 2018). Probability distribution MinHash generalizes the collision probability using the “maximally consistent sampling” criterion (Moulton et al., 2018).
Variance-Reduced and Efficient Schemes
- SuperMinHash: Uses permutation-based value shifts for variance reduction (factor , can halve the variance when ) and achieves amortized insertion time for large sets (Ertl, 2017).
- Circulant MinHash (C-MinHash): Employs one or two permutations with circulant shifts so that the estimator is unbiased (two-permutation) or has negligible bias (one-permutation), while provably achieving lower variance than independent MinHash (Li et al., 2021, Li et al., 2021).
- Dimension-Independent/Distributed: Strategies such as DISCO (MinHashSampleMap) yield unbiased estimators while reducing communication in MapReduce settings by O(N·L·k) → O(D·k·log(Dk)) without loss in accuracy (Zadeh et al., 2012).
- Fast Sketching: “Mixture” sketches construct -length signatures in time but guarantee the same collision and concentration properties as classical MinHash (Dahlgaard et al., 2017).
4. Extensions: Weighted, Continuous, and Spatial Objects
The MinHash–Jaccard criterion generalizes beyond finite sets:
- Weighted Sets (Generalized Jaccard): Estimators and collision probability retain the form for weighted objects via CWS and its improved variants. All key approaches, including quantization, active index sampling, and CWS, yield unbiased (or asymptotically unbiased) estimators, with per-coordinate complexities between and depending on the method (Wu et al., 2018).
- Probability Distributions: The generalized collision formula for nonnegative vectors (including probability distributions) is
which reduces to classical Jaccard when are binary. This is Pareto-optimal among LSHs based on sampling (Moulton et al., 2018).
- Spatial Domains: PolyMinHash adapts MinHash to polygonal objects by replacing set membership with geometric inclusion and uniform permutation with uniform random sampling in (Subedi et al., 20 Nov 2025). The collision probability still exactly recovers area-based Jaccard.
5. Empirical Analysis and Search Trade-offs
Precision and runtime in MinHash–Jaccard-based systems hinge on signature length :
- Increasing reduces estimator variance and hence false positive rates, but increases computation and space costs (Subedi et al., 20 Nov 2025, Ertl, 2017, Wu et al., 2018).
- In PolyMinHash, with , $70$– data pruning at high recall is achieved ($2$– speedup). With , up to pruning and $4$– speedup at recall $0.88$–$0.93$. With , pruning, up to $6$– speedup but recall drops to $0.60$–$0.71$ (Subedi et al., 20 Nov 2025).
- SuperMinHash and C-MinHash consistently yield strictly lower estimator variance than classical MinHash. SuperMinHash achieves up to lower variance for , and C-MinHash achieves uniform variance reduction for all (Ertl, 2017, Li et al., 2021, Li et al., 2021).
- Fast sketching schemes and distributed implementations preserve concentration while reducing runtime or communication (Zadeh et al., 2012, Dahlgaard et al., 2017).
6. Applications and Theoretical Significance
The MinHash–Jaccard criterion is foundational for:
- Approximate nearest neighbor (ANN) search (Text, spatial databases, trajectory matching): Locality-sensitive signatures enable sublinear filtering and massive data pruning (Subedi et al., 20 Nov 2025, Dahlgaard et al., 2017).
- Large-scale duplicate detection in document or entity collections (Zadeh et al., 2012).
- Distributed and streaming systems: Efficiently sketched similarity enables real-time analytics in big-data settings (Zadeh et al., 2012).
- Geometric/joint estimators: SetSketch and related approaches further improve joint-quantity estimation (intersection, union size, inclusion, and cosine similarity) with negligible lost accuracy or bias, leveraging the underlying MinHash–Jaccard structure (Ertl, 2021).
- Information-theoretic optimality: Extensions such as the “supermajority” approach attain provably optimal time-space exponents for similarity search (improving on MinHash by up to in both time and space) in the random instance regime (Ahle et al., 2019).
7. Limitations, Variants, and Open Questions
- For small sets (), densification and mixture methods may outperform one-bin-per-hash techniques (Dahlgaard et al., 2017).
- Weighted schemes such as Chum’s exponential estimator introduce bias; CWS and variants provide unbiasedness but at sometimes higher algorithmic or storage costs (Wu et al., 2018).
- Extensions to spatial, continuous, or structured domains (e.g., PolyMinHash) preserve the central collision probability principle but introduce domain-specific sampling and computational challenges (Subedi et al., 20 Nov 2025, Moulton et al., 2018).
- Open questions remain in further reducing sketch size for strict streaming constraints and in fully characterizing hypercontractive-optimal filters in the LSH framework (Ahle et al., 2019, Wu et al., 2018).
The MinHash–Jaccard criterion thus provides a robust theoretical and algorithmic backbone for efficient, accurate, and scalable similarity computation across diverse domains and datatypes, and ongoing research continues to broaden its mathematical reach, computational efficiency, and practical impact (Subedi et al., 20 Nov 2025, Ertl, 2017, Li et al., 2021, Wu et al., 2018, Dahlgaard et al., 2017, Moulton et al., 2018, Zadeh et al., 2012, Ertl, 2021, Ahle et al., 2019).