Papers
Topics
Authors
Recent
2000 character limit reached

MinHash Deduplication Overview

Updated 10 January 2026
  • MinHash Deduplication is a randomized, locality-sensitive hashing approach that estimates Jaccard similarity via MinHash sketches for scalable near-duplicate detection.
  • It employs shingling, efficient signature computation, and banding techniques to sharply threshold candidate pairs in sublinear time and space.
  • Advanced variants and hardware optimizations like one-/two-permutation methods, GPU acceleration, and privacy-preserving extensions enhance real-time, large-scale deduplication.

MinHash Deduplication is a class of randomized algorithms and data structures for identifying near-duplicate objects in massive datasets, based fundamentally on the Minwise Hashing (MinHash) locality-sensitive hashing (LSH) scheme. The central property of MinHash is that, for sets or binary vectors, the collision probability of their MinHash sketches equals their resemblance (Jaccard similarity), enabling scalable approximate deduplication and duplicate clustering with sublinear time and space overhead. Continued advances—including streaming variants, asymmetric and weighted schemes, memory-compressed sketches, and extreme-scale distributed pipelines—make MinHash a canonical tool not only for set deduplication but also for streaming, privacy-preserving, and real-time deduplication tasks.

1. Theoretical Foundations of MinHash Deduplication

MinHash operates on sets W{1,,D}W \subseteq \{1,\ldots,D\} or induced binary vectors. For a fixed random permutation %%%%1%%%% over {1,,D}\{1,\ldots, D\}, the MinHash value is hπ(W)=minπ(W)h_\pi(W) = \min \pi(W). Extending to kk independent permutations (kk “hashes”), the MinHash signature is [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)].

The defining property—Broder’s Theorem—is: Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R} where R\mathcal{R} is the Jaccard/resemblance similarity. The unbiased estimator R^\hat{\mathcal{R}} (fraction of matching signature positions) converges as kk grows, with variance R(1R)/k\mathcal{R}(1-\mathcal{R})/k (Shrivastava et al., 2014).

MinHash is thus a locality-sensitive hash (LSH) for the Jaccard similarity and can also serve as an (S0,cS0,p1,p2)(S_0, cS_0, p_1, p_2)-sensitive LSH for cosine similarity via the bounds S2RS/(2S)S^2 \leq \mathcal{R} \leq S/(2-S), where SS is cosine similarity and R\mathcal{R} the resemblance (Shrivastava et al., 2014). This allows adjustment of banding parameters in deduplication LSH for non-binary (cosine) tasks as well.

2. MinHash Deduplication Pipelines and Scalability

A canonical deduplication pipeline using MinHash involves:

  1. Shingling: Represent documents/items via nn-gram shingles, typically yielding sparse sets.
  2. MinHash Signature Computation: For each item, compute a kk-dimensional MinHash sketch.
  3. LSH Banding: Partition the kk-vector into bb bands of rr rows (k=brk = b \cdot r). Within each band, items with identical sub-sketches are “candidate” pairs.
  4. Candidate Filtering: Only candidate pairs have their full Jaccard (or refined) similarity thresholded.
  5. Clustering: Disjoint-set forest clustering (union-find) may be used to aggregate duplicate clusters efficiently (Shenoy et al., 2017).

The collision probability after banding is Pcandidate=1(1sr)bP_\mathrm{candidate} = 1 - (1 - s^r)^b for true Jaccard ss (Shenoy et al., 2017, Khan et al., 2024). This yields an “S-curve” in candidate probability, allowing sharp thresholding.

Highly scalable implementations use wide-column distributed stores (e.g., Cassandra), SIMD vectorization for signature operations, and GPU-accelerated pipelines for both MinHash computation and post-banding candidate comparison (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shenoy et al., 2017). LSH band tables may be stored as flat disk or in-memory arrays for high throughput.

3. Memory and Computation-Efficient MinHash Variants

Classical MinHash uses 32–64 bits per hash value, with kk in the 100–1024 range. Advanced variants address memory and computational bottlenecks:

  • One-/Two-Permutation MinHash: C-MinHash reduces required random permutations from KK to 2, or even to 1 (with negligible bias), via circulant shifting, yielding unbiased or nearly unbiased signatures with strictly lower variance compared to classical MinHash (Li et al., 2021, Li et al., 2021). This reduces storage from O(KD)O(KD) to O(D)O(D) and per-vector computation from O(Kz)O(Kz) to O(z+K)O(z+K) for sparsity zz.
  • b-bit MinHash & HyperMinHash: Storage is compressed by retaining only the lowest bb bits of each hash, or by encoding the minimal hash in floating-point (LogLog) style. HyperMinHash attains O(loglogn)O(\log\log n) space per bucket, remains mergeable/streamable, and allows Jaccard estimation for sets up to size n1019n\approx10^{19} within tight error budgets (Yu et al., 2017).
  • MaxLogHash: Designed for streaming/high-similarity cases, MaxLogHash stores in each register the integer part of log2hj(i)-\log_2 h_j(i) across elements ii; uses just 6–7 bits/register and is unbiased for high JJ similarity (Wang et al., 2019). MaxLogHash supports one-pass streaming updates and distributed merging.
  • DotHash: For weighted or IDF-based deduplication, DotHash sketches sets as sum-vectors of random sign-embeddings weighted by f(x)\sqrt{f(x)} for token xx; the dot product of two sketches yields an unbiased estimator for weighted intersection and, by extension, the Jaccard index. DotHash is especially useful for hybrid set/weighted deduplication, outperforming MinHash and SimHash in accuracy for IDF-based retrieval (Nunes et al., 2023).

A summary of memory and computational characteristics:

Method Per-sketch space Per-item update time Union/Merge Streaming support
MinHash O(klogn)O(k\log n) O(k)O(k) min\min Yes
b-bit MinHash O(kb)O(kb) O(k)O(k) No Yes
HyperMinHash O(k(loglogn+r))O(k(\log\log n+r)) O(k)O(k) minfp\min_\mathrm{fp} Yes
MaxLogHash O(7k)O(7k) O(k)O(k) max\max Yes
DotHash O(d)O(d) (float) O(dA)O(d|A|) sum Yes

4. Privacy-Preserving and Encrypted Deduplication

In encrypted deduplication, deterministic encryption exposes chunk frequency distributions, leading to inference attacks via frequency analysis. MinHash-based encryption, applied to segments (groups of chunks), breaks this link: each segment’s ciphertext depends on the minimum fingerprint among its plaintext chunks (per MinHash), so the mapping from plaintext to ciphertext-fingerprint is many-to-one (Li et al., 2019). This “frequency coarsening” sharply reduces information leakage: inference rates drop from up to 33.6% to 0.2–0.3% (with scrambling) at the cost of a few percent higher storage overhead. Combined with segment-wise random scrambling, attacks exploiting chunk locality are thwarted even further.

5. Large-Scale, Real-Time, and Hardware-Accelerated MinHash Deduplication

At massive scale (hundreds of millions to billions of items), both CPU-based and traditional LSH-index-based MinHash deduplication become IO- and memory-bound. Two key developments have enabled extreme scalability:

  • LSHBloom (Khan et al., 2024): Replaces expensive per-band LSH indexes with per-band Bloom filters; but candidate-generation guarantees and extremely low false positives (as low as 10510^{-5}) are preserved. This reduces index storage by up to 54×54\times at 2.5–3×\times runtime speedup for multi-billion document deduplication.
  • GPU-accelerated frameworks (FED) (Son et al., 2 Jan 2025): Use rolling polynomial hashes to generate MinHash signatures highly efficiently on the GPU, with signature generation and post-LSH bucketed comparison fully parallelized. FED achieves up to 107×107\times CPU baseline speed, with deduplication of 1.2 trillion tokens in 6 hours (16 GPUs), while identifying nearly identical duplicate sets as classical MinHash LSH, Jaccard >0.96>0.96.
  • SIMD vectorization (Muniyappa et al., 20 Feb 2025): In software, AVX2/AVX-512 vectorizes MinHash sketch operations and set union/intersection, yielding 4×4\times speedup in the intersection step and end-to-end deduplication/aggregation latency reduction from hours to seconds.

6. Extensions: Weighted, Asymmetric, and Streaming MinHash Deduplication

Standard MinHash is optimal for Jaccard similarity with unbiased estimates and collision probabilities monotonic in set resemblance. Recent advances extend MinHash deduplication in several directions:

  • Asymmetric Minwise Hashing (MH-ALSH) (Shrivastava et al., 2014): For applications requiring monotonicity in set overlap (inner product), asymmetric zero/one padding of sets pre- and post-hashing ensures collision probability is exactly a/(2Ma)a/(2M-a) (where a=ABa=|A\cap B|), removing size bias and yielding improved recall/precision for set containment queries.
  • Generalized/Weighted MinHash: Applying consistent sampling algorithms, such as the Exponential Race, extends MinHash to nonnegative or probability-weighted vectors, with a collision probability JP(x,y)J_P(x,y) provably Pareto-optimal among sampling-based LSH schemes (Moulton et al., 2018).
  • Dynamic/Adaptive Filters and Early Termination: Dynamic threshold filters, derived from binomial tails, allow early pruning or acceptance of pairs as candidates (or not) before all kk hash-matches have been computed, maintaining error probability below ee (e.g., 10510^{-5}), and reducing total compute time by 2–3×\times without recall or precision degradation (Long et al., 2018).

7. Practical Guidelines and Comparative Performance

Key parameter guidelines for high-recall, scalable MinHash deduplication:

  • Sketch length kk: For typical document/image/text deduplication, k=100k=100–$200$ suffices for recall 0.9\geq 0.9 in sparse data (Shrivastava et al., 2014, Shenoy et al., 2017).
  • (Banding) LSH parameters: Choose band count bb and rows per band rr to yield the “S-curve” at desired Jaccard threshold; b=50b=50, r=2r=2, k=100k=100 for clinical text; b=16b=16, r=8r=8, k=128k=128 for LLM-scale deduplication (Shenoy et al., 2017, Son et al., 2 Jan 2025, Khan et al., 2024).
  • b-bit/HyperMinHash: For memory-constrained scenarios, b=4b=4 or 8, or using (q+r)16(q+r)\approx 16 bits per bucket via HyperMinHash, provides near-identical performance with $4$–8×8\times smaller sketches (Yu et al., 2017).
  • Streaming/Real-time: MaxLogHash and HyperMinHash are drop-in replacements for MinHash in both streaming and distributed deployments, and support distributed merge-ability (Wang et al., 2019, Yu et al., 2017).
  • Hardware acceleration: Use SIMD, rolling hashes, and GPU kernels to achieve real-time throughput at billion-scale (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025).

Empirically, MinHash deduplication scans 5–10×\times fewer candidates for the same recall compared to SimHash, can be accelerated >100×>100\times on GPU, and offers sublinear computational complexity even for trillion-scale deduplication (Shrivastava et al., 2014, Son et al., 2 Jan 2025).


MinHash deduplication remains the gold standard for scalable, high-recall set, document, and chunk-level deduplication, with a large corpus of theoretical analysis, practical scaling recipes, hardware optimizations, and empirical validations across domains (Shrivastava et al., 2014, Shenoy et al., 2017, Li et al., 2019, Khan et al., 2024, Yu et al., 2017, Li et al., 2021, Li et al., 2021, Wang et al., 2019, Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shrivastava et al., 2014, Nunes et al., 2023, Moulton et al., 2018, Long et al., 2018).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MinHash Deduplication.