Asymptotic Probability Decoding (APD)

Updated 4 March 2026

APD is a technique that extrapolates token probability trajectories to estimate infinite-model behavior, improving prediction accuracy in language models and error-correcting codes.
It employs parametric curve fitting and energy networks to adjust next-token predictions, thereby refining methods like Contrastive Decoding without extra inference cost.
APD bridges theoretical insights with practical applications across language modeling, cryptography, and iterative decoding, though its complexity limits performance in worst-case scenarios.

Asymptotic Probability Decoding (APD) refers to a family of techniques that leverage limiting or extrapolated probabilistic behaviors of decoders or LLMs in order to achieve improved performance for tasks such as error correction, probabilistic inference, or language modeling. The common thread across distinct applications is the explicit modeling or utilization of asymptotic probability distributions—either extrapolated to infinite model size or derived as blocklength $n\to\infty$ . This article surveys the principal APD paradigms in language modeling, code-based cryptography, and decoding theory.

1. APD in Language Modeling: Motivation and Formalism

Contrastive Decoding (CD) improves LLM (LM) generations by combining the next-token logits of an “expert” LM (ELM) and a smaller “amateur” LM (ALM):

$L^{CD}_c(w)=L^{ELM}_c(w)-\tfrac{1}{T}L^{ALM}_c(w),$

$P^{CD}_c(w)\propto \exp(L^{CD}_c(w)).$

CD typically increases factuality and @@@@1@@@@ but exhibits “obvious blindness,” where high-probability answers under the ALM may be unduly discounted, suppressing the most factual outputs. Asymptotic Probability Decoding (APD) addresses this by modeling the evolution of each token’s predicted probability as a function of LM size. By fitting the probability trajectory $p_i=p(w|c, \theta_{s_i})$ across a range of model sizes $s_1 < \cdots < s_N$ , APD extrapolates to an “infinite” LM: $p_\infty(w|c)$ . This asymptotic distribution is then used—via a fine-tuned ALM′—within the CD architecture, achieving the effect of an extremely large LM without additional inference-time cost (Chang et al., 2024).

2. Theoretical Foundations of Asymptotic Probability Extrapolation

CD is theoretically equivalent to a linear extrapolation in logit-space to a hypothetical LM of size $s_H$ :

$L^{CD}_c(w) = (1-\frac{1}{T})L^{HLM}_c(w),\quad L^{HLM}(w)=L^{ELM}(w) + \frac{(\log s_H-\log s_E)}{(\log s_E-\log s_A)}(L^{ELM}(w)-L^{ALM}(w)).$

APD generalizes this, fitting a parametric model for the probability decay or rise as a function of log-model-size (potentially flipped for monotonicity):

$\widehat p(s) = \widehat P'_\infty + a\,\exp\bigl(-\max\{0,b(s-d)\}\bigr).$

An MLP “energy network” is trained to output $(a,b,d)$ given the observed probability vector, and a regularized ALM′ is fine-tuned so that CD employing ALM′ matches the extrapolated $p_\infty(w|c)$ . Once trained, the inference-time decoder remains

$L^{APD}_c(w) = L^{ELM}_c(w) - \tfrac{1}{T}L^{ALM'}_c(w), \quad P^{APD}_c(w) \propto \exp(L^{APD}_c(w)).$

This reduces to CD at runtime, with no additional evaluation or curve fitting required (Chang et al., 2024).

3. APD Techniques in Error-Correcting Codes

The “statistical decoding” or “probability decoding” method for linear codes is an early form of APD. For a binary $[n,k]$ linear code $C$ and error pattern $e$ (weight $t$ ), one precomputes a large set $\mathcal{L}_i$ of parity-check equations in $C^\perp$ of weight $w$ with $h_i=1$ . For each coordinate $i$ , one computes

$V_i=\sum_{h\in\mathcal{L}_i}\operatorname{sgn}(\varepsilon_1-\varepsilon_0)\langle y,h\rangle,$

and declares $\hat{e}_i=1$ if $V_i$ exceeds a bias-based threshold.

The “bias” $\varepsilon_1-\varepsilon_0$ between the cases $e_i=1$ and $e_i=0$ decays exponentially in $n$ , so the required number of parity-checks is $P_w\asymp 2^{2\alpha(\omega,\tau)n}$ where $\alpha(\omega,\tau)$ is an exponent computable via Krawtchouk polynomial techniques. On the Gilbert–Varshamov bound, the best possible APD exponent is always worse than Prange ISD and its successors:

$2\alpha(\omega,\tau_{GV}) > \alpha_{\text{Prange}}(R)$

for all rates $R$ . Even using all known improvements, APD cannot attain the asymptotic complexity of ISD for worst-case codeword decoding (Debris-Alazard et al., 2017).

4. Density Evolution and Asymptotic Probability in Iterative Decoding

In LDPC and GLDPC code analysis on the BEC and BI-AWGN channels, APD appears as finite-length corrections to the bit-error probability, as well as in density-evolution threshold analysis. The bit error probability for blocklength $n$ after $t$ BP iterations satisfies

$P_b(n,\epsilon,t) = P_b(\infty,\epsilon,t) + \alpha(\epsilon,t)/n + o(1/n),$

with $P_b(\infty,\epsilon,t)$ given by classical density-evolution:

$Q(t) = \epsilon \lambda(P(t-1)), \quad P(t) = 1 - \rho(1-Q(t)), \quad P_b(\infty,\epsilon,t) = \epsilon L(P(t)).$

The $O(1/n)$ correction constant $\alpha(\epsilon,t)$ is explicitly computable as a sum of cycle-free and single-cycle contributions, with efficient algorithms available for $(2,r)$ -regular LDPC ensembles. These corrections are highly accurate in the finite-length, moderate- $n$ regime (0801.0931).

In the GLDPC+APP context, asymptotic thresholds for convergence-error probability are governed by the iterative behavior of LLR densities, and the adoption of Gaussian Mixture Approximation (GMA) for the density-evolution equations allows for near-exact predictions at the computational cost of classical Gaussian Approximation. Exploiting message-invariant subcodes further reduces computational complexity. This analysis extends APD concepts to ensembles where capacity-approaching performance is achieved by tracking asymptotic probability flows across decoding iterations (Chang et al., 2024).

5. APD in Interleaved Reed–Solomon Codes

The “Power–IRS” decoding algorithm for $m$ -interleaved Reed–Solomon codes exemplifies a structural APD approach:

One constructs a non-linear, “powered” key-equation system capturing cross-row error locations.
High-multiplicity (power) constraints push the decoding radius toward the Johnson bound.
Linearization yields a simultaneous Hermite Padé approximation problem whose solvability up to a threshold error weight is established by dimension counting.

Asymptotically, this yields decoding up to relative error weights of $1-R^{m/(m+1)}$ in polynomial time with high probability on random errors:

$\tau_{max} = 1 - R^{m/(m+1)},$

outperforming previous radii for $m > 2$ and all code rates $R>1/3$ (Puchinger et al., 2017).

6. Empirical Performance and Practical Guidelines

Language Modeling

In open-ended factual text generation, such as FactualityPrompts, APD outperforms CD and model-doubling:

System	NEₑᵣ (↓)	Dist-2 (↑)
CD (Pythia 6.9B)	42.21%	48.57
APD	41.13%	47.44

On QA and MRC benchmarks, APD achieves perplexity below the scores of much larger LMs:

Task	PPL 6.9B	CD	APD	PPL 12B
LAMBADA	2.264	2.237	2.132†	2.188
CommonsenseQA	8.380	6.176	5.882†	8.140

APD’s cost at generation is identical to CD, with additional requirements only at training: fitting the curve model and fine-tuning the ALM′. Key practicalities include the need for several LMs trained on similar data and tuning the regularization parameter $\lambda_3$ per LM family (Chang et al., 2024).

Error-Correction Decoding

In statistical decoding, while theoretically appealing as a “probabilistic inference” approach, APD is fundamentally limited by its exponential complexity, which always exceeds that of Prange ISD and more advanced ISD variants on the critical Gilbert–Varshamov bound. Numerical exponents confirm this inferiority across all rates (Debris-Alazard et al., 2017).

In iterative coding, APD-style expansions allow high-precision prediction of the bit error rate even for small $n$ , and GMA improvements close the gap to Monte–Carlo density-evolution estimates with negligible extra cost (0801.0931, Chang et al., 2024).

7. Limitations and Open Directions

Known APD limitations include:

In language modeling, APD does not address exotic temperature regimes ( $T<1$ ), and its behavior on RLHF/fine-tuned LMs is untested. APD requires a collection of LMs with a spectrum of sizes trained on matching data, which may be impractical for some families (Chang et al., 2024).
For linear code decoding, APD cannot outperform ISD on the worst-case instance regime. Its positive results are confined to “random error” or high-probability regimes (Debris-Alazard et al., 2017).
In interleaved RS codes, results guarantee high-probability decoding for random errors; worst-case adversarial error correction remains at classical bounds (Puchinger et al., 2017).
Extensions to broader code families or to domains with strong symmetry-breaking remain the subject of ongoing research.

APD methods represent an interface between asymptotic probabilistic analysis and practical decoding, with demonstrated advantages in language modeling and soft-decision decoding theory, yet with clear theoretical restrictions in code-based cryptanalysis and certain coding-theoretic regimes.