Shannon's Source Coding Theorem

Updated 21 January 2026

Shannon's Source Coding Theorem is a foundational principle that quantifies the minimal bits per symbol for lossless compression based on the source’s entropy.
It leverages the Asymptotic Equipartition Property and typical sets to construct efficient prefix codes whose average lengths approach the theoretical entropy limit.
The theorem underpins modern compression methods like Huffman and arithmetic coding, offering practical insights for improving digital storage and transmission.

Shannon's Source Coding Theorem, also known as the noiseless coding theorem, establishes the fundamental limits of data compression for discrete memoryless sources. It states that for any given stationary ergodic information source with entropy rate $H(X)$ , the average number of bits per symbol required for lossless representation can be made arbitrarily close to $H(X)$ using sufficiently long block codes, but cannot be reduced below this bound. The theorem provides the theoretical foundation for modern data compression methods and delineates the asymptotic optimality of entropy as the minimal achievable coding rate (Stone, 2018, Suhov et al., 2016).

1. Foundational Definitions and Source Model

The discrete memoryless source (DMS) is characterized by a finite alphabet $\mathcal{X} = \{x_1, ..., x_m\}$ and an associated probability mass function $p(x)$ . Each source symbol $X_i$ is selected independently according to $p(x)$ , so a block $X^n = (X_1, ..., X_n)$ has joint probability $P[X^n = x^n] = \prod_{i=1}^n p(x_i)$ . The single-letter entropy is given by: $H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)$ which quantifies the average uncertainty per source symbol in bits (Stone, 2018). For stationary ergodic sources, the entropy rate is generalized as

$h = \lim_{n \to \infty} -\frac{1}{n} \sum_{x^n} P_n(x^n) \log_\ell P_n(x^n)$

where $\ell$ is the alphabet size and $P_n(x^n)$ is the $n$ -tuple probability (Suhov et al., 2016).

2. Formal Statement of the Theorem

Shannon’s Source Coding Theorem: For a stationary, memoryless source with entropy $H(X)$ , and any $\varepsilon > 0$ , there exists, for large enough blocklength $n$ , a prefix code such that the average code length per letter satisfies

$H(X) \leq \frac{L_n}{n} < H(X) + \varepsilon$

where $L_n$ is the expected length of the codeword per block. Conversely, no lossless code (even non-prefix) can achieve an average code length below $H(X)$ per symbol. Thus, as $n \to \infty$ , the achievable coding rate per symbol converges to $H(X)$ (Stone, 2018, Suhov et al., 2016).

3. Methods of Proof and Typical Sets

The traditional proof leverages the Asymptotic Equipartition Property (AEP), which states that for large $n$ , the set of typical sequences $A_\varepsilon^{(n)} = \{ x^n : |-\frac{1}{n}\log_2 P[x^n] - H(X)| \leq \varepsilon \}$ concentrates almost all probability mass. The cardinality of this set is bounded as $|A_\varepsilon^{(n)}| \leq 2^{n(H(X) + \varepsilon)}$ .

To construct efficient codes, typical sequences are mapped to codewords of length $\lceil n(H(X) + \varepsilon) \rceil$ , ensuring all but an arbitrarily small probability fraction are represented compactly; non-typical sequences are assigned fixed, possibly longer, codewords, making their average length contribution negligible as $n$ increases (Stone, 2018). The counting argument via the Kraft inequality ensures the converse: reducing average code length below $H(X)$ would violate the prefix condition.

Alternative proofs, notably in channel coding, employ the Markov inequality and the law of large numbers to avoid explicit construction of typical sets or dependencies on the AEP, providing didactic simplifications and extending naturally to other sources (Lomnitz et al., 2012).

4. Generalizations: Entropy Rate, Markov Sources, and Large Deviations

For stationary ergodic processes, including Markov sources, the entropy rate governs compressibility: $h = -\sum_{i,j} \pi_i^{\text{so}} p_{ij}^{\text{so}} \log_\ell p_{ij}^{\text{so}}$ where $\pi_i^{\text{so}}$ and $p_{ij}^{\text{so}}$ denote stationary and transition probabilities, respectively (Suhov et al., 2016). The Shannon–McMillan–Breiman theorem establishes that for almost every trajectory, $-\frac{1}{n}\log_\ell P_n(X^n) \to h$ . The asymptotic typical set contains $\ell^{nh}$ elements, and the optimal code cannot surpass this lower bound on average.

Further, large deviations theory enables refined analysis by incorporating arbitrary storage distributions, utility-weighted selection criteria, and general alphabets. The number of codewords required can be further reduced by restricting to subsets defined by both typicality and additional utility constraints, with the set size governed by rate function minimization: $\kappa := \lim_{n\to\infty} \frac{1}{n} \log p_n^{\text{st}}(\mathcal{S}_n) = -\inf\{ \Pi^*(z): z \in B \}$ where $B$ encodes the relevant constraints and $\Pi^*(z)$ is the large-deviation rate function (Suhov et al., 2016).

5. Coding Strategies and Practical Implications

The theorem’s idealized compression bounds are approached by block coding. Single-letter strategies, such as Huffman coding, fulfill $H(X) \leq L_1 < H(X) + 1$ for small alphabets. As block length $n$ increases, the gap to $H(X)$ shrinks at the cost of exponential codebook size. In practical data compression, moderate $n$ yields per-symbol rates within negligible fractions of a bit of entropy, often realized by arithmetic coding or universal schemes such as Lempel–Ziv algorithms (Stone, 2018).

Example: Summed Dice Source

For a source with $X = \{2, ..., 12\}$ representing the sum of two dice, $H(X) \approx 3.27$ bits/symbol. Fixed-length coding gives $4$ bits/symbol, Huffman coding achieves $L_1 \approx 3.29$ , and block coding over pairs with arithmetic or joint Huffman coding drives the average as close to $3.27$ as desired (Stone, 2018).

6. Extensions and Utility-Constrained Compression

Shannon’s original result is extended to encompass non-uniform storage costs, auxiliary utility measures, and sources with general alphabets. By employing large deviation principles, one selects sets of sequences based on both their information content and auxiliary functions (additive or multiplicative), enabling trade-offs between storage rate, error probability, and utility. In Markov and general settings, this approach yields precise rate-function-based bounds on storage requirements and integrates with convex optimization frameworks (Suhov et al., 2016).

7. Theoretical Significance and Limitations

Shannon’s Source Coding Theorem constitutes a cornerstone of information theory, precisely quantifying the minimal data rate for lossless compression as the entropy rate of the source. The result is robust under asymptotic blocklength and stationary ergodic source assumptions. No code, prefix or not, can exceed this fundamental lower bound. This universality underpins all subsequent advances in lossless compression and informs the design of efficient coding algorithms across discrete and continuous, memoryless or Markov, and more general source models (Stone, 2018, Suhov et al., 2016).

Markdown Upgrade to Chat

References (3)

Information Theory: A Tutorial Introduction (2018)

On principles of large deviation and selected data compression (2016)

A simpler derivation of the coding theorem (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shannon's Source Coding Theorem.