Constant Composition Distribution Matching

Updated 26 March 2026

CCDM is a fixed-length, invertible mapping that converts uniform bits into output symbol sequences with a constant empirical distribution, ensuring accurate probabilistic amplitude shaping.
It employs enumerative and arithmetic coding techniques to generate permutations of symbols with prescribed frequencies, optimizing spectral efficiency in communications.
Finite-length implementations highlight tradeoffs in rate loss and divergence, prompting the development of parallel, lookup-based, and multi-composition variants to enhance practical performance.

Constant Composition Distribution Matching (CCDM) is a fixed-length, invertible mapping that transforms a sequence of uniformly distributed bits into a sequence of output symbols drawn from a finite alphabet, with the property that every output block has exactly the same empirical distribution—its composition vector is fixed. CCDM is the canonical distribution matcher (DM) for probabilistic amplitude shaping (PAS) in communication systems, where one must generate amplitude sequences matching a prescribed, typically nonuniform, probability mass function (PMF) to approach capacity on nonuniform-input channels. In CCDM, all blocks output by the mapper are permutations of each other, differing only in symbol order, with the number of occurrences of each symbol determined by the specified composition.

1. Definition and Mathematical Foundations

Given a finite alphabet $\mathcal{A} = \{a_1, ..., a_m\}$ and target block length $n$ , a composition (type) vector $C = (n_1, ..., n_m)$ is specified such that $n_i \approx n P_A(a_i)$ and $\sum_{i=1}^m n_i = n$ , where $P_A$ is the quantized target PMF. The constant-composition type class is

$T(n; C) = \left\{ x^n \in \mathcal{A}^n \ :\ \#\{ j : x_j = a_i \} = n_i,\ \forall\ i \right\}.$

The codebook size is the multinomial

$|T(n;C)| = \frac{n!}{\prod_{i=1}^m n_i!}.$

A bijective mapping $f_\text{CCDM}\colon \{0,1\}^k \to T(n;C)$ is constructed, where $k = \lfloor\log_2 |T(n;C)|\rfloor$ , so $R = k/n$ is the matching rate in bits per symbol (Schulte et al., 2015, Fehenberger et al., 2018, Gültekin et al., 2019).

2. Algorithms and Implementations

CCDM is implemented using either enumerative (rank/unrank) coding or arithmetic coding (AC).

Enumerative (Rank/Unrank) Coding: The input bits are interpreted as an integer index $j$ , which is then mapped to the $j$ -th sequence in lexicographic order of $T(n;C)$ . Each symbol of the output sequence is determined by decrementing counts, updating the index, and running through possible symbols in the coordination sequence (Fehenberger et al., 2018, Gültekin et al., 2019).
Arithmetic Coding Based CCDM: Input bits form a binary fraction in $[0,1)$ . The output sequence is generated via successive interval refinements, with symbol probabilities determined by the “without replacement” counts $q_i^{(j)} = c_i^{(j)}/(n-j+1)$ , where $c_i^{(j)}$ are the remaining symbol counts at step $j$ (Schulte et al., 2015, Fehenberger et al., 2020). Demapping is the exact inverse process.

Several variants for optimizing complexity and hardware friendliness have been introduced:

Multiset Ranking (MR-CCDM): Reduces sequential operations relative to AC by removing symbol-by-symbol dependence (Fehenberger et al., 2020).
Subset Ranking (SR-CCDM): Specializes to binary alphabets, using combinatorial rank/unrank via subset indices, enabling further parallelism and lookup-table (LUT) based implementations (Fehenberger et al., 2019).
Log-CCDM: Replaces multiplies/divides of standard AC with purely additive LUT-based operations, reducing hardware requirements and needed arithmetic precision from $O(n)$ to $O(\log n)$ bits (Gültekin et al., 2022).
Finite-Precision Arithmetic (FPA): Practical AC CCDM implementations must round all intervals. The resulting rate loss is shown to decay exponentially with the number of precision bits and can be tightly bounded (Pikus et al., 2019).

3. Rate, Rate Loss, and Divergence

The achievable rate and rate loss are determined as follows:

Operational Rate: $R_\mathrm{CCDM}(n) = \frac{1}{n} \log_2 |T(n;C)|$ .
Target Entropy: $H(P_A) = -\sum_{i} P_A(a_i) \log_2 P_A(a_i)$ .
Rate Loss: $\Delta R(n) = H(P_A) - R_\mathrm{CCDM}(n)$ , characterizing the penalty due to finite blocklength and integer quantization.

Stirling’s approximation yields

$R_\mathrm{CCDM}(n) = H(P_A) - \frac{m-1}{2n}\log_2 n + O\left(\frac{1}{n}\right),$

so for large $n$ , CCDM is asymptotically optimal ( $R \to H(P_A)$ , $\Delta R \to 0$ ). However, at short blocklengths the rate loss becomes significant, with $\Delta R(n)$ decaying only logarithmically with $n$ . This finite-length penalty leads to normalized informational divergence $D(P_{A^n}||P_A^n)/n = H(\hat{P}) - R + D(\hat{P}||P_A)$ , vanishing as $n \to \infty$ (Schulte et al., 2015, Schulte et al., 2017, Gültekin et al., 2019).

4. Role in PAS and Fiber-Optic Systems

CCDM is the established distribution matcher in Probabilistic Amplitude Shaping (PAS) frameworks for coded modulation. Its function is to realize nonuniform amplitude distributions by enforcing constant empirical symbol frequencies, which are then married with uniform sign bits and FEC coding in PAS. Although PAS requires invertible, block-to-block mapping for forward error correction, the strict constant-composition of CCDM at finite blocklengths introduces operational tradeoffs.

Recent work has highlighted that in nonlinear fiber-optic systems, CCDM block length significantly impacts post-transmission effective SNR. Namely, SNR degrades roughly as $-10\log_{10} N$ (dB) with block length $N$ (from 0.7 dB SNR gain at $N=10$ down to the baseline at $N=10^4$ ) (Fehenberger et al., 2020). Two mechanisms are involved:

2D QAM Symbol Correlations: Small blocks exhibit bias in high-energy symbol pairs due to limited pairings.
Temporal "Shuffling": The inherent limit on long runs of identical symbols in short blocks disrupts nonlinear phase noise buildup.

Guidelines for optical system design now include striking a balance between shaping rate loss (favors large $N$ ) and nonlinear-noise reduction (favors small $N$ ), with the optimum $N$ varying with link and modulation format. Adding an interleaver can erase block-length induced SNR dependence (Fehenberger et al., 2020, Wu et al., 2021).

5. Extensions: List-based and Multi-composition DM

CCDM has inspired a series of generalized distribution matchers that relax the constant-composition constraint or exploit per-block diversity:

Multi-composition DM (MCDM): Instead of strictly one composition, MCDM allows a union of several, increasing the codebook size, raising the mapping rate, and reducing the output–target divergence, especially for short $n$ (Pikus et al., 2019).
Multiset-Partition DM (MPDM): Generalizes CCDM by permitting controlled variability in compositions, obtaining greater codebook sizes and improved rate/distance tradeoffs while retaining practical combinatorial/AC-based mappings (Fehenberger et al., 2018, Gültekin et al., 2019).
List-encoding CCDM (L-CCDM): Incorporates the energy-dispersion index (EDI) as a nonlinear-noise-aware figure of merit, computing a small list of candidate sequences per block and selecting those with lowest EDI, yielding up to 0.35 dB SNR gain in fiber transmission at modest additional complexity and negligible rate penalty for moderate blocklengths (Wu et al., 2021).

Comparison of CCDM with these variants, including enumerative sphere shaping (ESS) and shell mapping (SM), consistently finds that while CCDM is optimal asymptotically, its finite-length rate loss and divergence motivate these relaxed or tailorable variants—MPDM and ESS consistently achieve lower rate loss than CCDM for practical block lengths (Fehenberger et al., 2018, Gültekin et al., 2019, Gültekin et al., 2019).

6. Complexity, Parallelization, and Practical Implementation

Complexity and throughput are crucial for integration into real-time systems:

AC-CCDM: Sequential in mapping (number of input bits $k$ ) or demapping ( $n$ ). $O(n|\mathcal{A}|)$ arithmetic operations.
MR-CCDM/SR-CCDM: Reduce serialism—only $O(n)$ steps with partial parallelization (MR), or down to $\min(k, n-k)$ steps in SR for binary alphabets.
Parallel-Amplitude DM (PA-DM): Decompose nonbinary CCDM into several parallel binary SR-CCDMs, yielding $m-1$ -fold parallelization with negligible rate penalty (Fehenberger et al., 2019).
Lookup Table and Log-CCDM: Hardware-friendliness via LUTs or shift/add-only designs with storage requirements readily <10 kB for $n$ up to 1 k (Gültekin et al., 2022, Fehenberger et al., 2020).
Finite Precision: Loss from rounding in fixed-point/integer AC decays exponentially with mantissa bits; $b\gtrsim \log_2 n$ suffices to make extra rate loss negligible (Pikus et al., 2019).
Integration: CCDM and its variants can be modularly embedded in composite shaping or joint demapping/decoding chains (e.g., HCSS, PAS frontends) to exploit either rate, complexity, or nonlinear-noise tradeoffs (Fehenberger et al., 2020, Wu et al., 2021).

7. Performance Metrics and Practical Tradeoffs

Quantitative performance is governed by the required blocklength $n$ to reach a target rate loss or divergence:

To achieve rate loss $\lesssim 0.025$ bits/symbol or to stay within 0.2 dB of AWGN channel capacity, CCDM typically requires $n\gtrsim 250$ for QAM/PAS applications (Fehenberger et al., 2018, Gültekin et al., 2019).
At smaller $n$ , CCDM’s normalized divergence and rate loss grow logarithmically, becoming the limiting factor for shaping gain.
Run-length metrics, kurtosis, and energy-dispersion indices now play a role in advanced optical systems, linking symbolic structure in output blocks to nonlinear transmission impairments (Fehenberger et al., 2020, Wu et al., 2021).

In summary, CCDM is foundational for distribution matching in modern coded modulation schemes, enabling high spectral efficiency and flexible rate adaptation via PAS. While asymptotically optimal, its finite-length performance is shaped by combinatorial constraints and operational tradeoffs, motivating a spectrum of modern generalizations that balance rate, entropy, SNR, and computational efficiency (Schulte et al., 2015, Fehenberger et al., 2018, Fehenberger et al., 2020, Gültekin et al., 2022, Wu et al., 2021).