Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deduplication Lemma in Source Coding

Updated 24 March 2026
  • The Deduplication Lemma is a lossless compression technique that rigorously quantifies bounds on expected code length using dictionary-based methods.
  • It analyzes the encoding process where unseen chunks incur an n-bit cost and repeated chunks use logarithmic pointer bits, establishing both finite-sample and asymptotic performance guarantees.
  • The lemma highlights the convergence behavior of deduplication schemes in an i.i.d. uniform setting, showing compression within 1–2 bits of source entropy per chunk.

Deduplication Lemma

Deduplication, in the context of source coding, is a lossless compression technique for the efficient removal of long-range data duplicates. The Deduplication Lemma provides precise probabilistic and information-theoretic bounds on the expected code length incurred by classic dictionary-based deduplication. It is central to analyzing the compression efficiency and convergence behavior of deduplication schemes in the regime where data is modeled as a sequence of i.i.d. uniform chunks drawn from a finite subset of binary strings. This lemma, formalized as Theorem 2 in (Vestergaard et al., 2019), captures both finite-sample and asymptotic efficiency of deduplication protocols by relating the expected output length to the underlying source entropy and dictionary growth dynamics.

1. Source Model and Deduplication Scheme

Deduplication operates on an alphabet of binary vectors of length nn, denoted Z2n\mathbb{Z}_2^n. The chunk source ZZ2n\mathcal{Z} \subset \mathbb{Z}_2^n, with cardinality Z|\mathcal{Z}|, is modeled as a uniform i.i.d. source: a sequence of CC chunks z1,,zCz_1, \dots, z_C are drawn uniformly and independently from Z\mathcal{Z}. Classic deduplication maintains a dictionary Dc1\mathcal{D}^{c-1} of distinct chunks seen up to the (c1)(c-1)-th position. Encoding zcz_c proceeds as:

  • Emit a 1-bit 'new-chunk' flag and transmit the nn-bit chunk if zcDc1z_c \notin \mathcal{D}^{c-1}, then add zcz_c to the dictionary.
  • Emit a 1-bit 'repeat' flag and a pointer of length logDc1\lceil \log |\mathcal{D}^{c-1}| \rceil bits if zcDc1z_c \in \mathcal{D}^{c-1}.

By design, X=Z\mathcal{X} = \mathcal{Z} and the deviation set Y={0n}\mathcal{Y} = \{0^n\}, so each chunk constitutes its own "base," and the mapping φ\varphi is the identity.

2. Statement of the Deduplication Lemma

Let RD(C)R_D(C) denote the expected coded length after processing CC chunks and ΔRD(C)=RD(C)RD(C1)\Delta R_D(C) = R_D(C) - R_D(C-1). Under the above model:

Lower Bound:

RD(C)c=1C[1+nPr(zcDc1)+Pr(zcDc1)log(Dc1)]R_D(C) \geq \sum_{c=1}^C \left[ 1 + n \cdot \Pr\left(z_c \notin \mathcal{D}^{c-1}\right) + \Pr\left(z_c \in \mathcal{D}^{c-1}\right) \log \left(|\mathcal{D}^{c-1}|\right) \right]

Explicitly,

RD(C)C+c=1C[n(1Z1)c1+(1(1Z1)c1)log(Z[1(1Z1)c1])]R_D(C) \geq C + \sum_{c=1}^C \left[ n (1 - |\mathcal{Z}|^{-1})^{c-1} + \left(1 - (1 - |\mathcal{Z}|^{-1})^{c-1}\right)\log\left(|\mathcal{Z}|\left[1 - (1 - |\mathcal{Z}|^{-1})^{c-1}\right]\right) \right]

Upper Bound:

RD(C)2C+c=1C[n(1Z1)c1+Z1min{(c1)log(c1),ZlogZ}]CR_D(C) \leq 2C + \sum_{c=1}^C \left[ n (1 - |\mathcal{Z}|^{-1})^{c-1} + |\mathcal{Z}|^{-1}\min\{ (c-1)\log(c-1), |\mathcal{Z}|\log|\mathcal{Z}| \} \right] - C

This lemma quantifies the cost per chunk (including flag bits, new chunk payloads, and dictionary pointer bits) as a function of chunk index, alphabet size, and dictionary growth.

3. Probabilistic and Information-Theoretic Analysis

Expectation over the one-step coding contribution is computed as: E[length]=1+E[I{zcDc1}n+I{zcDc1}(Dc1)]\mathbb{E}[\text{length}] = 1 + \mathbb{E}\big[ \mathbb{I}\{z_c \notin \mathcal{D}^{c-1}\} \cdot n + \mathbb{I}\{z_c \in \mathcal{D}^{c-1}\} \cdot \ell(\mathcal{D}^{c-1}) \big] where (Dc1)=logDc1\ell(\mathcal{D}^{c-1}) = \lceil \log |\mathcal{D}^{c-1}| \rceil. Probabilities are given by: Pr(zcDc1)=(11/Z)c1,Pr(zcDc1)=1(11/Z)c1\Pr\left(z_c \notin \mathcal{D}^{c-1}\right) = (1 - 1/|\mathcal{Z}|)^{c-1}, \quad \Pr\left(z_c \in \mathcal{D}^{c-1}\right) = 1 - (1 - 1/|\mathcal{Z}|)^{c-1} These combine to yield tight bounds via sandwiching (Dc1)\ell(\mathcal{D}^{c-1}) appropriately and summing over all cc.

4. Asymptotic Properties and Convergence

As CC \rightarrow \infty, the expected cost per chunk satisfies: H(Z)+1limCΔRD(C)H(Z)+2H(\mathcal{Z}) + 1 \leq \lim_{C \to \infty} \Delta R_D(C) \leq H(\mathcal{Z}) + 2 Thus, classic deduplication is asymptotically within 2 bits per chunk of the source entropy, with a minimum gap of 1 bit. The dominant convergence rate is geometric,

μD=limcPr[zc+1Dc]Pr[zcDc1]=11/Z\mu_D = \lim_{c\to\infty}\frac{\Pr[z_{c+1}\notin\mathcal{D}^c]}{\Pr[z_c\notin\mathcal{D}^{c-1}]} = 1 - 1/|\mathcal{Z}|

reflecting the exponential decay of unseen-chunk probabilities in the i.i.d. regime.

5. Numerical Illustration and Empirical Validation

A concrete example uses the (7,4) Hamming code:

  • n=7n=7.
  • X=\mathcal{X}' = Hamming codewords.
  • Select X=2|\mathcal{X}|=2 codewords, and set Y=\mathcal{Y} = binary vectors of weight 1\leq 1, so Z=16|\mathcal{Z}|=16.
  • Simulations for CC chunks confirm that RD(C)/nR_D(C)/n and ΔRD(C)/n\Delta R_D(C)/n quickly drop from the naïve n+1n+1 bits/chunk toward the asymptotic H(Z)+1=5H(\mathcal{Z})+1=5 bits/chunk.
  • The derived bounds closely track observed behavior throughout.

6. Significance and Extensions

The Deduplication Lemma rigorously substantiates the compression efficiency and dictionary learning dynamics of classic deduplication. Setting Y={0n}\mathcal{Y} = \{0^n\} yields standard deduplication; the lemma’s bounds apply tightly to its finite-chunk regime. Extensions in (Vestergaard et al., 2019) generalize the analysis to cases where deviations are permitted (i.e., Z=XY\mathcal{Z} = \mathcal{X} \oplus \mathcal{Y}, Y>1|\mathcal{Y}|>1), allowing deduplication at the granularity of "bases" X\mathcal{X} with deviation coding. This yields comparable bounds with significantly accelerated convergence due to the smaller effective base dictionary, a key property for practical deduplication performance.

This analysis assumes truly i.i.d. chunk sources and uniformity over Z\mathcal{Z}; deviations from these prerequisites (correlated sources, variable-length chunks, or non-uniform distributions) are not directly addressed. The extension to generalized deduplication with nontrivial Y\mathcal{Y} enables efficient handling of similar (not strictly identical) data, which is crucial for high-compression scenarios in enterprise and cloud storage settings (Vestergaard et al., 2019). The theoretical framework bridges information theory and practical coding, providing order-optimality guarantees for well-specified source models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deduplication Lemma.