Papers
Topics
Authors
Recent
Search
2000 character limit reached

MinHash Deduplication Overview

Updated 10 January 2026
  • MinHash Deduplication is a randomized, locality-sensitive hashing approach that estimates Jaccard similarity via MinHash sketches for scalable near-duplicate detection.
  • It employs shingling, efficient signature computation, and banding techniques to sharply threshold candidate pairs in sublinear time and space.
  • Advanced variants and hardware optimizations like one-/two-permutation methods, GPU acceleration, and privacy-preserving extensions enhance real-time, large-scale deduplication.

MinHash Deduplication is a class of randomized algorithms and data structures for identifying near-duplicate objects in massive datasets, based fundamentally on the Minwise Hashing (MinHash) locality-sensitive hashing (LSH) scheme. The central property of MinHash is that, for sets or binary vectors, the collision probability of their MinHash sketches equals their resemblance (Jaccard similarity), enabling scalable approximate deduplication and duplicate clustering with sublinear time and space overhead. Continued advances—including streaming variants, asymmetric and weighted schemes, memory-compressed sketches, and extreme-scale distributed pipelines—make MinHash a canonical tool not only for set deduplication but also for streaming, privacy-preserving, and real-time deduplication tasks.

1. Theoretical Foundations of MinHash Deduplication

MinHash operates on sets W{1,,D}W \subseteq \{1,\ldots,D\} or induced binary vectors. For a fixed random permutation π\pi over {1,,D}\{1,\ldots, D\}, the MinHash value is hπ(W)=minπ(W)h_\pi(W) = \min \pi(W). Extending to kk independent permutations (kk “hashes”), the MinHash signature is [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)].

The defining property—Broder’s Theorem—is: Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R} where R\mathcal{R} is the Jaccard/resemblance similarity. The unbiased estimator R^\hat{\mathcal{R}} (fraction of matching signature positions) converges as π\pi0 grows, with variance π\pi1 (Shrivastava et al., 2014).

MinHash is thus a locality-sensitive hash (LSH) for the Jaccard similarity and can also serve as an π\pi2-sensitive LSH for cosine similarity via the bounds π\pi3, where π\pi4 is cosine similarity and π\pi5 the resemblance (Shrivastava et al., 2014). This allows adjustment of banding parameters in deduplication LSH for non-binary (cosine) tasks as well.

2. MinHash Deduplication Pipelines and Scalability

A canonical deduplication pipeline using MinHash involves:

  1. Shingling: Represent documents/items via π\pi6-gram shingles, typically yielding sparse sets.
  2. MinHash Signature Computation: For each item, compute a π\pi7-dimensional MinHash sketch.
  3. LSH Banding: Partition the π\pi8-vector into π\pi9 bands of {1,,D}\{1,\ldots, D\}0 rows ({1,,D}\{1,\ldots, D\}1). Within each band, items with identical sub-sketches are “candidate” pairs.
  4. Candidate Filtering: Only candidate pairs have their full Jaccard (or refined) similarity thresholded.
  5. Clustering: Disjoint-set forest clustering (union-find) may be used to aggregate duplicate clusters efficiently (Shenoy et al., 2017).

The collision probability after banding is {1,,D}\{1,\ldots, D\}2 for true Jaccard {1,,D}\{1,\ldots, D\}3 (Shenoy et al., 2017, Khan et al., 2024). This yields an “S-curve” in candidate probability, allowing sharp thresholding.

Highly scalable implementations use wide-column distributed stores (e.g., Cassandra), SIMD vectorization for signature operations, and GPU-accelerated pipelines for both MinHash computation and post-banding candidate comparison (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shenoy et al., 2017). LSH band tables may be stored as flat disk or in-memory arrays for high throughput.

3. Memory and Computation-Efficient MinHash Variants

Classical MinHash uses 32–64 bits per hash value, with {1,,D}\{1,\ldots, D\}4 in the 100–1024 range. Advanced variants address memory and computational bottlenecks:

  • One-/Two-Permutation MinHash: C-MinHash reduces required random permutations from {1,,D}\{1,\ldots, D\}5 to 2, or even to 1 (with negligible bias), via circulant shifting, yielding unbiased or nearly unbiased signatures with strictly lower variance compared to classical MinHash (Li et al., 2021, Li et al., 2021). This reduces storage from {1,,D}\{1,\ldots, D\}6 to {1,,D}\{1,\ldots, D\}7 and per-vector computation from {1,,D}\{1,\ldots, D\}8 to {1,,D}\{1,\ldots, D\}9 for sparsity hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)0.
  • b-bit MinHash & HyperMinHash: Storage is compressed by retaining only the lowest hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)1 bits of each hash, or by encoding the minimal hash in floating-point (LogLog) style. HyperMinHash attains hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)2 space per bucket, remains mergeable/streamable, and allows Jaccard estimation for sets up to size hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)3 within tight error budgets (Yu et al., 2017).
  • MaxLogHash: Designed for streaming/high-similarity cases, MaxLogHash stores in each register the integer part of hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)4 across elements hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)5; uses just 6–7 bits/register and is unbiased for high hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)6 similarity (Wang et al., 2019). MaxLogHash supports one-pass streaming updates and distributed merging.
  • DotHash: For weighted or IDF-based deduplication, DotHash sketches sets as sum-vectors of random sign-embeddings weighted by hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)7 for token hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)8; the dot product of two sketches yields an unbiased estimator for weighted intersection and, by extension, the Jaccard index. DotHash is especially useful for hybrid set/weighted deduplication, outperforming MinHash and SimHash in accuracy for IDF-based retrieval (Nunes et al., 2023).

A summary of memory and computational characteristics:

Method Per-sketch space Per-item update time Union/Merge Streaming support
MinHash hπ(W)=minπ(W)h_\pi(W) = \min \pi(W)9 kk0 kk1 Yes
b-bit MinHash kk2 kk3 No Yes
HyperMinHash kk4 kk5 kk6 Yes
MaxLogHash kk7 kk8 kk9 Yes
DotHash kk0 (float) kk1 sum Yes

4. Privacy-Preserving and Encrypted Deduplication

In encrypted deduplication, deterministic encryption exposes chunk frequency distributions, leading to inference attacks via frequency analysis. MinHash-based encryption, applied to segments (groups of chunks), breaks this link: each segment’s ciphertext depends on the minimum fingerprint among its plaintext chunks (per MinHash), so the mapping from plaintext to ciphertext-fingerprint is many-to-one (Li et al., 2019). This “frequency coarsening” sharply reduces information leakage: inference rates drop from up to 33.6% to 0.2–0.3% (with scrambling) at the cost of a few percent higher storage overhead. Combined with segment-wise random scrambling, attacks exploiting chunk locality are thwarted even further.

5. Large-Scale, Real-Time, and Hardware-Accelerated MinHash Deduplication

At massive scale (hundreds of millions to billions of items), both CPU-based and traditional LSH-index-based MinHash deduplication become IO- and memory-bound. Two key developments have enabled extreme scalability:

  • LSHBloom (Khan et al., 2024): Replaces expensive per-band LSH indexes with per-band Bloom filters; but candidate-generation guarantees and extremely low false positives (as low as kk2) are preserved. This reduces index storage by up to kk3 at 2.5–3kk4 runtime speedup for multi-billion document deduplication.
  • GPU-accelerated frameworks (FED) (Son et al., 2 Jan 2025): Use rolling polynomial hashes to generate MinHash signatures highly efficiently on the GPU, with signature generation and post-LSH bucketed comparison fully parallelized. FED achieves up to kk5 CPU baseline speed, with deduplication of 1.2 trillion tokens in 6 hours (16 GPUs), while identifying nearly identical duplicate sets as classical MinHash LSH, Jaccard kk6.
  • SIMD vectorization (Muniyappa et al., 20 Feb 2025): In software, AVX2/AVX-512 vectorizes MinHash sketch operations and set union/intersection, yielding kk7 speedup in the intersection step and end-to-end deduplication/aggregation latency reduction from hours to seconds.

6. Extensions: Weighted, Asymmetric, and Streaming MinHash Deduplication

Standard MinHash is optimal for Jaccard similarity with unbiased estimates and collision probabilities monotonic in set resemblance. Recent advances extend MinHash deduplication in several directions:

  • Asymmetric Minwise Hashing (MH-ALSH) (Shrivastava et al., 2014): For applications requiring monotonicity in set overlap (inner product), asymmetric zero/one padding of sets pre- and post-hashing ensures collision probability is exactly kk8 (where kk9), removing size bias and yielding improved recall/precision for set containment queries.
  • Generalized/Weighted MinHash: Applying consistent sampling algorithms, such as the Exponential Race, extends MinHash to nonnegative or probability-weighted vectors, with a collision probability [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]0 provably Pareto-optimal among sampling-based LSH schemes (Moulton et al., 2018).
  • Dynamic/Adaptive Filters and Early Termination: Dynamic threshold filters, derived from binomial tails, allow early pruning or acceptance of pairs as candidates (or not) before all [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]1 hash-matches have been computed, maintaining error probability below [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]2 (e.g., [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]3), and reducing total compute time by 2–3[h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]4 without recall or precision degradation (Long et al., 2018).

7. Practical Guidelines and Comparative Performance

Key parameter guidelines for high-recall, scalable MinHash deduplication:

  • Sketch length [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]5: For typical document/image/text deduplication, [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]6–[h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]7 suffices for recall [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]8 in sparse data (Shrivastava et al., 2014, Shenoy et al., 2017).
  • (Banding) LSH parameters: Choose band count [h1(W),,hk(W)][h_1(W), \ldots, h_k(W)]9 and rows per band Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}0 to yield the “S-curve” at desired Jaccard threshold; Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}1, Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}2, Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}3 for clinical text; Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}4, Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}5, Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}6 for LLM-scale deduplication (Shenoy et al., 2017, Son et al., 2 Jan 2025, Khan et al., 2024).
  • b-bit/HyperMinHash: For memory-constrained scenarios, Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}7 or 8, or using Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}8 bits per bucket via HyperMinHash, provides near-identical performance with Pr[hπ(W1)=hπ(W2)]=W1W2W1W2=R\mathrm{Pr}[h_\pi(W_1) = h_\pi(W_2)] = \frac{|W_1 \cap W_2|}{|W_1 \cup W_2|} = \mathcal{R}9–R\mathcal{R}0 smaller sketches (Yu et al., 2017).
  • Streaming/Real-time: MaxLogHash and HyperMinHash are drop-in replacements for MinHash in both streaming and distributed deployments, and support distributed merge-ability (Wang et al., 2019, Yu et al., 2017).
  • Hardware acceleration: Use SIMD, rolling hashes, and GPU kernels to achieve real-time throughput at billion-scale (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025).

Empirically, MinHash deduplication scans 5–10R\mathcal{R}1 fewer candidates for the same recall compared to SimHash, can be accelerated R\mathcal{R}2 on GPU, and offers sublinear computational complexity even for trillion-scale deduplication (Shrivastava et al., 2014, Son et al., 2 Jan 2025).


MinHash deduplication remains the gold standard for scalable, high-recall set, document, and chunk-level deduplication, with a large corpus of theoretical analysis, practical scaling recipes, hardware optimizations, and empirical validations across domains (Shrivastava et al., 2014, Shenoy et al., 2017, Li et al., 2019, Khan et al., 2024, Yu et al., 2017, Li et al., 2021, Li et al., 2021, Wang et al., 2019, Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shrivastava et al., 2014, Nunes et al., 2023, Moulton et al., 2018, Long et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinHash Deduplication.