MinHash Deduplication Overview
- MinHash Deduplication is a randomized, locality-sensitive hashing approach that estimates Jaccard similarity via MinHash sketches for scalable near-duplicate detection.
- It employs shingling, efficient signature computation, and banding techniques to sharply threshold candidate pairs in sublinear time and space.
- Advanced variants and hardware optimizations like one-/two-permutation methods, GPU acceleration, and privacy-preserving extensions enhance real-time, large-scale deduplication.
MinHash Deduplication is a class of randomized algorithms and data structures for identifying near-duplicate objects in massive datasets, based fundamentally on the Minwise Hashing (MinHash) locality-sensitive hashing (LSH) scheme. The central property of MinHash is that, for sets or binary vectors, the collision probability of their MinHash sketches equals their resemblance (Jaccard similarity), enabling scalable approximate deduplication and duplicate clustering with sublinear time and space overhead. Continued advances—including streaming variants, asymmetric and weighted schemes, memory-compressed sketches, and extreme-scale distributed pipelines—make MinHash a canonical tool not only for set deduplication but also for streaming, privacy-preserving, and real-time deduplication tasks.
1. Theoretical Foundations of MinHash Deduplication
MinHash operates on sets or induced binary vectors. For a fixed random permutation %%%%1%%%% over , the MinHash value is . Extending to independent permutations ( “hashes”), the MinHash signature is .
The defining property—Broder’s Theorem—is: where is the Jaccard/resemblance similarity. The unbiased estimator (fraction of matching signature positions) converges as grows, with variance (Shrivastava et al., 2014).
MinHash is thus a locality-sensitive hash (LSH) for the Jaccard similarity and can also serve as an -sensitive LSH for cosine similarity via the bounds , where is cosine similarity and the resemblance (Shrivastava et al., 2014). This allows adjustment of banding parameters in deduplication LSH for non-binary (cosine) tasks as well.
2. MinHash Deduplication Pipelines and Scalability
A canonical deduplication pipeline using MinHash involves:
- Shingling: Represent documents/items via -gram shingles, typically yielding sparse sets.
- MinHash Signature Computation: For each item, compute a -dimensional MinHash sketch.
- LSH Banding: Partition the -vector into bands of rows (). Within each band, items with identical sub-sketches are “candidate” pairs.
- Candidate Filtering: Only candidate pairs have their full Jaccard (or refined) similarity thresholded.
- Clustering: Disjoint-set forest clustering (union-find) may be used to aggregate duplicate clusters efficiently (Shenoy et al., 2017).
The collision probability after banding is for true Jaccard (Shenoy et al., 2017, Khan et al., 2024). This yields an “S-curve” in candidate probability, allowing sharp thresholding.
Highly scalable implementations use wide-column distributed stores (e.g., Cassandra), SIMD vectorization for signature operations, and GPU-accelerated pipelines for both MinHash computation and post-banding candidate comparison (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shenoy et al., 2017). LSH band tables may be stored as flat disk or in-memory arrays for high throughput.
3. Memory and Computation-Efficient MinHash Variants
Classical MinHash uses 32–64 bits per hash value, with in the 100–1024 range. Advanced variants address memory and computational bottlenecks:
- One-/Two-Permutation MinHash: C-MinHash reduces required random permutations from to 2, or even to 1 (with negligible bias), via circulant shifting, yielding unbiased or nearly unbiased signatures with strictly lower variance compared to classical MinHash (Li et al., 2021, Li et al., 2021). This reduces storage from to and per-vector computation from to for sparsity .
- b-bit MinHash & HyperMinHash: Storage is compressed by retaining only the lowest bits of each hash, or by encoding the minimal hash in floating-point (LogLog) style. HyperMinHash attains space per bucket, remains mergeable/streamable, and allows Jaccard estimation for sets up to size within tight error budgets (Yu et al., 2017).
- MaxLogHash: Designed for streaming/high-similarity cases, MaxLogHash stores in each register the integer part of across elements ; uses just 6–7 bits/register and is unbiased for high similarity (Wang et al., 2019). MaxLogHash supports one-pass streaming updates and distributed merging.
- DotHash: For weighted or IDF-based deduplication, DotHash sketches sets as sum-vectors of random sign-embeddings weighted by for token ; the dot product of two sketches yields an unbiased estimator for weighted intersection and, by extension, the Jaccard index. DotHash is especially useful for hybrid set/weighted deduplication, outperforming MinHash and SimHash in accuracy for IDF-based retrieval (Nunes et al., 2023).
A summary of memory and computational characteristics:
| Method | Per-sketch space | Per-item update time | Union/Merge | Streaming support |
|---|---|---|---|---|
| MinHash | Yes | |||
| b-bit MinHash | No | Yes | ||
| HyperMinHash | Yes | |||
| MaxLogHash | Yes | |||
| DotHash | (float) | sum | Yes |
4. Privacy-Preserving and Encrypted Deduplication
In encrypted deduplication, deterministic encryption exposes chunk frequency distributions, leading to inference attacks via frequency analysis. MinHash-based encryption, applied to segments (groups of chunks), breaks this link: each segment’s ciphertext depends on the minimum fingerprint among its plaintext chunks (per MinHash), so the mapping from plaintext to ciphertext-fingerprint is many-to-one (Li et al., 2019). This “frequency coarsening” sharply reduces information leakage: inference rates drop from up to 33.6% to 0.2–0.3% (with scrambling) at the cost of a few percent higher storage overhead. Combined with segment-wise random scrambling, attacks exploiting chunk locality are thwarted even further.
5. Large-Scale, Real-Time, and Hardware-Accelerated MinHash Deduplication
At massive scale (hundreds of millions to billions of items), both CPU-based and traditional LSH-index-based MinHash deduplication become IO- and memory-bound. Two key developments have enabled extreme scalability:
- LSHBloom (Khan et al., 2024): Replaces expensive per-band LSH indexes with per-band Bloom filters; but candidate-generation guarantees and extremely low false positives (as low as ) are preserved. This reduces index storage by up to at 2.5–3 runtime speedup for multi-billion document deduplication.
- GPU-accelerated frameworks (FED) (Son et al., 2 Jan 2025): Use rolling polynomial hashes to generate MinHash signatures highly efficiently on the GPU, with signature generation and post-LSH bucketed comparison fully parallelized. FED achieves up to CPU baseline speed, with deduplication of 1.2 trillion tokens in 6 hours (16 GPUs), while identifying nearly identical duplicate sets as classical MinHash LSH, Jaccard .
- SIMD vectorization (Muniyappa et al., 20 Feb 2025): In software, AVX2/AVX-512 vectorizes MinHash sketch operations and set union/intersection, yielding speedup in the intersection step and end-to-end deduplication/aggregation latency reduction from hours to seconds.
6. Extensions: Weighted, Asymmetric, and Streaming MinHash Deduplication
Standard MinHash is optimal for Jaccard similarity with unbiased estimates and collision probabilities monotonic in set resemblance. Recent advances extend MinHash deduplication in several directions:
- Asymmetric Minwise Hashing (MH-ALSH) (Shrivastava et al., 2014): For applications requiring monotonicity in set overlap (inner product), asymmetric zero/one padding of sets pre- and post-hashing ensures collision probability is exactly (where ), removing size bias and yielding improved recall/precision for set containment queries.
- Generalized/Weighted MinHash: Applying consistent sampling algorithms, such as the Exponential Race, extends MinHash to nonnegative or probability-weighted vectors, with a collision probability provably Pareto-optimal among sampling-based LSH schemes (Moulton et al., 2018).
- Dynamic/Adaptive Filters and Early Termination: Dynamic threshold filters, derived from binomial tails, allow early pruning or acceptance of pairs as candidates (or not) before all hash-matches have been computed, maintaining error probability below (e.g., ), and reducing total compute time by 2–3 without recall or precision degradation (Long et al., 2018).
7. Practical Guidelines and Comparative Performance
Key parameter guidelines for high-recall, scalable MinHash deduplication:
- Sketch length : For typical document/image/text deduplication, –$200$ suffices for recall in sparse data (Shrivastava et al., 2014, Shenoy et al., 2017).
- (Banding) LSH parameters: Choose band count and rows per band to yield the “S-curve” at desired Jaccard threshold; , , for clinical text; , , for LLM-scale deduplication (Shenoy et al., 2017, Son et al., 2 Jan 2025, Khan et al., 2024).
- b-bit/HyperMinHash: For memory-constrained scenarios, or 8, or using bits per bucket via HyperMinHash, provides near-identical performance with $4$– smaller sketches (Yu et al., 2017).
- Streaming/Real-time: MaxLogHash and HyperMinHash are drop-in replacements for MinHash in both streaming and distributed deployments, and support distributed merge-ability (Wang et al., 2019, Yu et al., 2017).
- Hardware acceleration: Use SIMD, rolling hashes, and GPU kernels to achieve real-time throughput at billion-scale (Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025).
Empirically, MinHash deduplication scans 5–10 fewer candidates for the same recall compared to SimHash, can be accelerated on GPU, and offers sublinear computational complexity even for trillion-scale deduplication (Shrivastava et al., 2014, Son et al., 2 Jan 2025).
MinHash deduplication remains the gold standard for scalable, high-recall set, document, and chunk-level deduplication, with a large corpus of theoretical analysis, practical scaling recipes, hardware optimizations, and empirical validations across domains (Shrivastava et al., 2014, Shenoy et al., 2017, Li et al., 2019, Khan et al., 2024, Yu et al., 2017, Li et al., 2021, Li et al., 2021, Wang et al., 2019, Son et al., 2 Jan 2025, Muniyappa et al., 20 Feb 2025, Shrivastava et al., 2014, Nunes et al., 2023, Moulton et al., 2018, Long et al., 2018).