Deduplication Lemma in Source Coding
- The Deduplication Lemma is a lossless compression technique that rigorously quantifies bounds on expected code length using dictionary-based methods.
- It analyzes the encoding process where unseen chunks incur an n-bit cost and repeated chunks use logarithmic pointer bits, establishing both finite-sample and asymptotic performance guarantees.
- The lemma highlights the convergence behavior of deduplication schemes in an i.i.d. uniform setting, showing compression within 1–2 bits of source entropy per chunk.
Deduplication Lemma
Deduplication, in the context of source coding, is a lossless compression technique for the efficient removal of long-range data duplicates. The Deduplication Lemma provides precise probabilistic and information-theoretic bounds on the expected code length incurred by classic dictionary-based deduplication. It is central to analyzing the compression efficiency and convergence behavior of deduplication schemes in the regime where data is modeled as a sequence of i.i.d. uniform chunks drawn from a finite subset of binary strings. This lemma, formalized as Theorem 2 in (Vestergaard et al., 2019), captures both finite-sample and asymptotic efficiency of deduplication protocols by relating the expected output length to the underlying source entropy and dictionary growth dynamics.
1. Source Model and Deduplication Scheme
Deduplication operates on an alphabet of binary vectors of length , denoted . The chunk source , with cardinality , is modeled as a uniform i.i.d. source: a sequence of chunks are drawn uniformly and independently from . Classic deduplication maintains a dictionary of distinct chunks seen up to the -th position. Encoding proceeds as:
- Emit a 1-bit 'new-chunk' flag and transmit the -bit chunk if , then add to the dictionary.
- Emit a 1-bit 'repeat' flag and a pointer of length bits if .
By design, and the deviation set , so each chunk constitutes its own "base," and the mapping is the identity.
2. Statement of the Deduplication Lemma
Let denote the expected coded length after processing chunks and . Under the above model:
Lower Bound:
Explicitly,
Upper Bound:
This lemma quantifies the cost per chunk (including flag bits, new chunk payloads, and dictionary pointer bits) as a function of chunk index, alphabet size, and dictionary growth.
3. Probabilistic and Information-Theoretic Analysis
Expectation over the one-step coding contribution is computed as: where . Probabilities are given by: These combine to yield tight bounds via sandwiching appropriately and summing over all .
4. Asymptotic Properties and Convergence
As , the expected cost per chunk satisfies: Thus, classic deduplication is asymptotically within 2 bits per chunk of the source entropy, with a minimum gap of 1 bit. The dominant convergence rate is geometric,
reflecting the exponential decay of unseen-chunk probabilities in the i.i.d. regime.
5. Numerical Illustration and Empirical Validation
A concrete example uses the (7,4) Hamming code:
- .
- Hamming codewords.
- Select codewords, and set binary vectors of weight , so .
- Simulations for chunks confirm that and quickly drop from the naïve bits/chunk toward the asymptotic bits/chunk.
- The derived bounds closely track observed behavior throughout.
6. Significance and Extensions
The Deduplication Lemma rigorously substantiates the compression efficiency and dictionary learning dynamics of classic deduplication. Setting yields standard deduplication; the lemma’s bounds apply tightly to its finite-chunk regime. Extensions in (Vestergaard et al., 2019) generalize the analysis to cases where deviations are permitted (i.e., , ), allowing deduplication at the granularity of "bases" with deviation coding. This yields comparable bounds with significantly accelerated convergence due to the smaller effective base dictionary, a key property for practical deduplication performance.
7. Related Models and Limitations
This analysis assumes truly i.i.d. chunk sources and uniformity over ; deviations from these prerequisites (correlated sources, variable-length chunks, or non-uniform distributions) are not directly addressed. The extension to generalized deduplication with nontrivial enables efficient handling of similar (not strictly identical) data, which is crucial for high-compression scenarios in enterprise and cloud storage settings (Vestergaard et al., 2019). The theoretical framework bridges information theory and practical coding, providing order-optimality guarantees for well-specified source models.