Papers
Topics
Authors
Recent
2000 character limit reached

MinHash–Jaccard Criterion

Updated 8 January 2026
  • MinHash–Jaccard criterion is a method that equates the collision probability of hash signatures with Jaccard similarity, applicable to sets, multisets, and spatial objects.
  • It provides unbiased estimators using independent hash functions and incorporates variance reduction techniques for improved accuracy in similarity estimation.
  • Widely used in data mining and information retrieval, it supports approximate nearest neighbor search, duplicate detection, and efficient sketching in large-scale systems.

The MinHash–Jaccard criterion underpins a widely used family of techniques for efficient set and similarity search in large-scale data mining and information retrieval. It is centered on the equivalence between the probability that two sets, polygons, or weighted objects produce the same MinHash value, and their Jaccard (resemblance) similarity. This criterion immediately yields unbiased estimators for Jaccard similarity and enables locality-sensitive hashing (LSH) constructions, ANN search, and a variety of advanced sketching and sampling methodologies. The criterion has been extended from unweighted sets to weighted multisets, continuous measures, and even domain-specific objects such as polygons in spatial databases (Subedi et al., 20 Nov 2025, Wu et al., 2018, Ertl, 2017, Li et al., 2021, Dahlgaard et al., 2017, Zadeh et al., 2012, Subedi et al., 20 Nov 2025, Moulton et al., 2018).

1. Formal Definition of the MinHash–Jaccard Criterion

Let AA and BB be two objects (typically finite sets, multisets, or measurable regions) in a universe UU. Their Jaccard similarity is defined as

J(A,B)=ABAB.J(A, B) = \frac{|A \cap B|}{|A \cup B|}.

For polygons (A,BR2A, B \subseteq \mathbb{R}^2), we have

J(A,B)=Area(AB)Area(AB).J(A, B) = \frac{\operatorname{Area}(A \cap B)}{\operatorname{Area}(A \cup B)}.

A MinHash function hh maps sets to signatures such that

Pr[h(A)=h(B)]=J(A,B).\Pr[h(A) = h(B)] = J(A, B).

For binary sets, a classical MinHash applies a random permutation π\pi to UU and sets hπ(S)=argminxSπ(x)h_\pi(S) = \arg\min_{x \in S} \pi(x). For polygons, MinHash is instantiated by sampling uniform random points in a bounding rectangle and counting the number of samples until the first falls within the object (Subedi et al., 20 Nov 2025).

2. Collision Probability and Estimation Properties

The core theoretical result is that for independent hash functions (or sampling seeds), the collision probability for any single MinHash coordinate equals the Jaccard similarity:

Pr[hi(A)=hi(B)]=J(A,B),\Pr[h_i(A) = h_i(B)] = J(A, B),

for all ii (Subedi et al., 20 Nov 2025, Li et al., 2021, Wu et al., 2018).

Given kk independent MinHash components, define XiX_i as the indicator that the iith coordinate matches, so

J^=1ki=1kXi\widehat{J} = \frac{1}{k}\sum_{i=1}^k X_i

provides an unbiased estimator:

E[J^]=J(A,B),Var(J^)=J(1J)k.\mathbb{E}[\widehat{J}] = J(A, B), \quad \operatorname{Var}(\widehat{J}) = \frac{J(1-J)}{k}.

Concentration bounds follow from Hoeffding's or Chernoff's inequality. For any ε>0\varepsilon > 0,

Pr[J^Jε]2exp(2kε2)\Pr\left[|\widehat{J} - J| \geq \varepsilon\right] \leq 2 \exp(-2k\varepsilon^2)

(Subedi et al., 20 Nov 2025, Dahlgaard et al., 2017, Wu et al., 2018).

This property extends to generalized criteria, e.g., generalized Jaccard for weighted sets, polygonal intersection for spatial objects, and probability distribution analogues (Wu et al., 2018, Moulton et al., 2018).

3. Algorithmic Realizations: Classical, PolyMinHash, and Advanced Schemes

Standard MinHash

For kk hash functions, sketch each set SS as (h1(S),...,hk(S))(h_1(S), ..., h_k(S)) with hj(S)=minxShj(x)h_j(S) = \min_{x \in S} h_j(x). Jaccard estimation reduces to fraction of coordinate-wise collisions (Wu et al., 2018).

PolyMinHash for Area-based Similarity

For a polygon PP within bounding rectangle BB:

  • For each of kk seeds, repeatedly sample (x,y)Uniform(B)(x, y) \sim \mathrm{Uniform}(B)
  • Count attempts until a sampled point lands in PP
  • hi(P)h_i(P) is the attempt number for the iith seed Sigantures are compared via collision counts; collision probability matches area-based Jaccard (Subedi et al., 20 Nov 2025). See pseudocode in original for details.

Weighted and Probabilistic Extensions

Weighted MinHash via Consistent Weighted Sampling (CWS) and variants produces unbiased estimators of weighted Jaccard similarity:

Jw(w,w)=imin(wi,wi)imax(wi,wi)J_w(w, w') = \frac{\sum_i \min(w_i, w'_i)}{\sum_i \max(w_i, w'_i)}

Implementations such as ICWS, PCWS, 0-bit CWS, and I2^2CWS provide efficiency/accuracy trade-offs in large-scale settings (Wu et al., 2018). Probability distribution MinHash generalizes the collision probability using the “maximally consistent sampling” criterion (Moulton et al., 2018).

Variance-Reduced and Efficient Schemes

  • SuperMinHash: Uses permutation-based value shifts for variance reduction (factor α(m,u)<1\alpha(m, u)<1, can halve the variance when ABm|A \cup B| \ll m) and achieves O(1)O(1) amortized insertion time for large sets (Ertl, 2017).
  • Circulant MinHash (C-MinHash): Employs one or two permutations with circulant shifts so that the estimator is unbiased (two-permutation) or has negligible bias (one-permutation), while provably achieving lower variance than independent MinHash (Li et al., 2021, Li et al., 2021).
  • Dimension-Independent/Distributed: Strategies such as DISCO (MinHashSampleMap) yield unbiased estimators while reducing communication in MapReduce settings by O(N·L·k) → O(D·k·log(Dk)) without loss in accuracy (Zadeh et al., 2012).
  • Fast Sketching: “Mixture” sketches construct tt-length signatures in O(tlogt+A)O(t\log t + |A|) time but guarantee the same collision and concentration properties as classical MinHash (Dahlgaard et al., 2017).

4. Extensions: Weighted, Continuous, and Spatial Objects

The MinHash–Jaccard criterion generalizes beyond finite sets:

  • Weighted Sets (Generalized Jaccard): Estimators and collision probability retain the form Pr[h(A)=h(B)]=Jw(w,w)\Pr[h(A) = h(B)] = J_w(w, w') for weighted objects via CWS and its improved variants. All key approaches, including quantization, active index sampling, and CWS, yield unbiased (or asymptotically unbiased) estimators, with per-coordinate complexities between O(n)O(n) and O(ilogwi)O(\sum_i \log w_i) depending on the method (Wu et al., 2018).
  • Probability Distributions: The generalized collision formula for nonnegative vectors (including probability distributions) is

J(x,y)=i:xi>0,yi>01jmax(xj/xi,yj/yi)J(x, y) = \sum_{i:\,x_i>0,\,y_i>0} \frac{1}{\sum_j \max(x_j/x_i,\,y_j/y_i)}

which reduces to classical Jaccard when x,yx, y are binary. This is Pareto-optimal among LSHs based on sampling (Moulton et al., 2018).

  • Spatial Domains: PolyMinHash adapts MinHash to polygonal objects by replacing set membership with geometric inclusion and uniform permutation with uniform random sampling in R2\mathbb{R}^2 (Subedi et al., 20 Nov 2025). The collision probability still exactly recovers area-based Jaccard.

5. Empirical Analysis and Search Trade-offs

Precision and runtime in MinHash–Jaccard-based systems hinge on signature length kk:

  • Increasing kk reduces estimator variance and hence false positive rates, but increases computation and space costs (Subedi et al., 20 Nov 2025, Ertl, 2017, Wu et al., 2018).
  • In PolyMinHash, with m=1m=1, $70$–75%75\% data pruning at high recall is achieved ($2$–3×3\times speedup). With m=3m=3, up to 89%89\% pruning and $4$–5×5\times speedup at recall $0.88$–$0.93$. With m=5m=5, 98%98\% pruning, up to $6$–7×7\times speedup but recall drops to $0.60$–$0.71$ (Subedi et al., 20 Nov 2025).
  • SuperMinHash and C-MinHash consistently yield strictly lower estimator variance than classical MinHash. SuperMinHash achieves up to 2×2\times lower variance for AB<m|A \cup B|<m, and C-MinHash achieves uniform variance reduction for all JJ (Ertl, 2017, Li et al., 2021, Li et al., 2021).
  • Fast sketching schemes and distributed implementations preserve concentration while reducing runtime or communication (Zadeh et al., 2012, Dahlgaard et al., 2017).

6. Applications and Theoretical Significance

The MinHash–Jaccard criterion is foundational for:

  • Approximate nearest neighbor (ANN) search (Text, spatial databases, trajectory matching): Locality-sensitive signatures enable sublinear filtering and massive data pruning (Subedi et al., 20 Nov 2025, Dahlgaard et al., 2017).
  • Large-scale duplicate detection in document or entity collections (Zadeh et al., 2012).
  • Distributed and streaming systems: Efficiently sketched similarity enables real-time analytics in big-data settings (Zadeh et al., 2012).
  • Geometric/joint estimators: SetSketch and related approaches further improve joint-quantity estimation (intersection, union size, inclusion, and cosine similarity) with negligible lost accuracy or bias, leveraging the underlying MinHash–Jaccard structure (Ertl, 2021).
  • Information-theoretic optimality: Extensions such as the “supermajority” approach attain provably optimal time-space exponents for similarity search (improving on MinHash by up to n0.14n^{0.14} in both time and space) in the random instance regime (Ahle et al., 2019).

7. Limitations, Variants, and Open Questions

  • For small sets (Ak|A| \ll k), densification and mixture methods may outperform one-bin-per-hash techniques (Dahlgaard et al., 2017).
  • Weighted schemes such as Chum’s exponential estimator introduce bias; CWS and variants provide unbiasedness but at sometimes higher algorithmic or storage costs (Wu et al., 2018).
  • Extensions to spatial, continuous, or structured domains (e.g., PolyMinHash) preserve the central collision probability principle but introduce domain-specific sampling and computational challenges (Subedi et al., 20 Nov 2025, Moulton et al., 2018).
  • Open questions remain in further reducing sketch size for strict streaming constraints and in fully characterizing hypercontractive-optimal filters in the LSH framework (Ahle et al., 2019, Wu et al., 2018).

The MinHash–Jaccard criterion thus provides a robust theoretical and algorithmic backbone for efficient, accurate, and scalable similarity computation across diverse domains and datatypes, and ongoing research continues to broaden its mathematical reach, computational efficiency, and practical impact (Subedi et al., 20 Nov 2025, Ertl, 2017, Li et al., 2021, Wu et al., 2018, Dahlgaard et al., 2017, Moulton et al., 2018, Zadeh et al., 2012, Ertl, 2021, Ahle et al., 2019).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MinHash-Jaccard Criterion.