Deduplication Lemma in Source Coding

Updated 24 March 2026

The Deduplication Lemma is a lossless compression technique that rigorously quantifies bounds on expected code length using dictionary-based methods.
It analyzes the encoding process where unseen chunks incur an n-bit cost and repeated chunks use logarithmic pointer bits, establishing both finite-sample and asymptotic performance guarantees.
The lemma highlights the convergence behavior of deduplication schemes in an i.i.d. uniform setting, showing compression within 1–2 bits of source entropy per chunk.

Deduplication Lemma

Deduplication, in the context of source coding, is a lossless compression technique for the efficient removal of long-range data duplicates. The Deduplication Lemma provides precise probabilistic and information-theoretic bounds on the expected code length incurred by classic dictionary-based deduplication. It is central to analyzing the compression efficiency and convergence behavior of deduplication schemes in the regime where data is modeled as a sequence of i.i.d. uniform chunks drawn from a finite subset of binary strings. This lemma, formalized as Theorem 2 in (Vestergaard et al., 2019), captures both finite-sample and asymptotic efficiency of deduplication protocols by relating the expected output length to the underlying source entropy and dictionary growth dynamics.

1. Source Model and Deduplication Scheme

Deduplication operates on an alphabet of binary vectors of length $n$ , denoted $\mathbb{Z}_2^n$ . The chunk source $\mathcal{Z} \subset \mathbb{Z}_2^n$ , with cardinality $|\mathcal{Z}|$ , is modeled as a uniform i.i.d. source: a sequence of $C$ chunks $z_1, \dots, z_C$ are drawn uniformly and independently from $\mathcal{Z}$ . Classic deduplication maintains a dictionary $\mathcal{D}^{c-1}$ of distinct chunks seen up to the $(c-1)$ -th position. Encoding $z_c$ proceeds as:

Emit a 1-bit 'new-chunk' flag and transmit the $n$ -bit chunk if $z_c \notin \mathcal{D}^{c-1}$ , then add $z_c$ to the dictionary.
Emit a 1-bit 'repeat' flag and a pointer of length $\lceil \log |\mathcal{D}^{c-1}| \rceil$ bits if $z_c \in \mathcal{D}^{c-1}$ .

By design, $\mathcal{X} = \mathcal{Z}$ and the deviation set $\mathcal{Y} = \{0^n\}$ , so each chunk constitutes its own "base," and the mapping $\varphi$ is the identity.

2. Statement of the Deduplication Lemma

Let $R_D(C)$ denote the expected coded length after processing $C$ chunks and $\Delta R_D(C) = R_D(C) - R_D(C-1)$ . Under the above model:

Lower Bound:

$R_D(C) \geq \sum_{c=1}^C \left[ 1 + n \cdot \Pr\left(z_c \notin \mathcal{D}^{c-1}\right) + \Pr\left(z_c \in \mathcal{D}^{c-1}\right) \log \left(|\mathcal{D}^{c-1}|\right) \right]$

Explicitly,

$R_D(C) \geq C + \sum_{c=1}^C \left[ n (1 - |\mathcal{Z}|^{-1})^{c-1} + \left(1 - (1 - |\mathcal{Z}|^{-1})^{c-1}\right)\log\left(|\mathcal{Z}|\left[1 - (1 - |\mathcal{Z}|^{-1})^{c-1}\right]\right) \right]$

Upper Bound:

$R_D(C) \leq 2C + \sum_{c=1}^C \left[ n (1 - |\mathcal{Z}|^{-1})^{c-1} + |\mathcal{Z}|^{-1}\min\{ (c-1)\log(c-1), |\mathcal{Z}|\log|\mathcal{Z}| \} \right] - C$

This lemma quantifies the cost per chunk (including flag bits, new chunk payloads, and dictionary pointer bits) as a function of chunk index, alphabet size, and dictionary growth.

3. Probabilistic and Information-Theoretic Analysis

Expectation over the one-step coding contribution is computed as: $\mathbb{E}[\text{length}] = 1 + \mathbb{E}\big[ \mathbb{I}\{z_c \notin \mathcal{D}^{c-1}\} \cdot n + \mathbb{I}\{z_c \in \mathcal{D}^{c-1}\} \cdot \ell(\mathcal{D}^{c-1}) \big]$ where $\ell(\mathcal{D}^{c-1}) = \lceil \log |\mathcal{D}^{c-1}| \rceil$ . Probabilities are given by: $\Pr\left(z_c \notin \mathcal{D}^{c-1}\right) = (1 - 1/|\mathcal{Z}|)^{c-1}, \quad \Pr\left(z_c \in \mathcal{D}^{c-1}\right) = 1 - (1 - 1/|\mathcal{Z}|)^{c-1}$ These combine to yield tight bounds via sandwiching $\ell(\mathcal{D}^{c-1})$ appropriately and summing over all $c$ .

4. Asymptotic Properties and Convergence

As $C \rightarrow \infty$ , the expected cost per chunk satisfies: $H(\mathcal{Z}) + 1 \leq \lim_{C \to \infty} \Delta R_D(C) \leq H(\mathcal{Z}) + 2$ Thus, classic deduplication is asymptotically within 2 bits per chunk of the source entropy, with a minimum gap of 1 bit. The dominant convergence rate is geometric,

$\mu_D = \lim_{c\to\infty}\frac{\Pr[z_{c+1}\notin\mathcal{D}^c]}{\Pr[z_c\notin\mathcal{D}^{c-1}]} = 1 - 1/|\mathcal{Z}|$

reflecting the exponential decay of unseen-chunk probabilities in the i.i.d. regime.

5. Numerical Illustration and Empirical Validation

A concrete example uses the (7,4) Hamming code:

$n=7$ .
$\mathcal{X}' =$ Hamming codewords.
Select $|\mathcal{X}|=2$ codewords, and set $\mathcal{Y} =$ binary vectors of weight $\leq 1$ , so $|\mathcal{Z}|=16$ .
Simulations for $C$ chunks confirm that $R_D(C)/n$ and $\Delta R_D(C)/n$ quickly drop from the naïve $n+1$ bits/chunk toward the asymptotic $H(\mathcal{Z})+1=5$ bits/chunk.
The derived bounds closely track observed behavior throughout.

6. Significance and Extensions

The Deduplication Lemma rigorously substantiates the compression efficiency and dictionary learning dynamics of classic deduplication. Setting $\mathcal{Y} = \{0^n\}$ yields standard deduplication; the lemma’s bounds apply tightly to its finite-chunk regime. Extensions in (Vestergaard et al., 2019) generalize the analysis to cases where deviations are permitted (i.e., $\mathcal{Z} = \mathcal{X} \oplus \mathcal{Y}$ , $|\mathcal{Y}|>1$ ), allowing deduplication at the granularity of "bases" $\mathcal{X}$ with deviation coding. This yields comparable bounds with significantly accelerated convergence due to the smaller effective base dictionary, a key property for practical deduplication performance.

This analysis assumes truly i.i.d. chunk sources and uniformity over $\mathcal{Z}$ ; deviations from these prerequisites (correlated sources, variable-length chunks, or non-uniform distributions) are not directly addressed. The extension to generalized deduplication with nontrivial $\mathcal{Y}$ enables efficient handling of similar (not strictly identical) data, which is crucial for high-compression scenarios in enterprise and cloud storage settings (Vestergaard et al., 2019). The theoretical framework bridges information theory and practical coding, providing order-optimality guarantees for well-specified source models.

Markdown Report Issue Upgrade to Chat

References (1)

Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deduplication Lemma.