Residual Context Diffusion (RCD)

Updated 3 February 2026

The paper demonstrates that recycling non-committed token distributions as residual embeddings significantly improves denoising performance.
RCD uses entropy-weighted interpolation to merge mask embeddings with computed residuals, enabling refined iterative predictions.
Empirical results show up to 10 percentage point accuracy gains and reduced denoising steps in benchmark evaluations with minimal overhead.

Residual Context Diffusion (RCD) is a module for block-wise discrete diffusion LLMs (dLLMs) that recycles computation on remasked tokens by converting their discarded predictive distributions into contextual residual embeddings, which are injected back into the model as refined signals for subsequent denoising steps. RCD addresses the inefficiency of standard block-wise diffusion LLMs, in which only the most confident tokens are decoded at each iteration while the rest are remasked, discarding valuable information. By extracting and structuring the information contained in these intermediate predictions, RCD achieves substantial improvements in accuracy on challenging language modeling benchmarks with minimal computational overhead (Hu et al., 30 Jan 2026).

1. Motivation and Conceptual Foundation

Block-wise dLLMs, such as SDAR and LLaDA, operate by iteratively denoising masked blocks of tokens, predicting all positions in parallel and committing the top- $m$ most confident tokens at each step. The standard “remasking” mechanism resets low-confidence positions to the mask token $[M]$ , discarding their predictive distributions. Empirical analysis reveals that even non-committed positions often contain the correct answer in their early-step top-k predictions, indicating that these distributions encode useful contextual information. The core insight of RCD is to reinterpret the representations of these non-committed tokens as “residual context”: continuous signals that can be fed forward, thereby reducing wasted computation and enhancing the quality of future denoising (Hu et al., 30 Jan 2026).

2. Mathematical Formulation

Let $x^{(0)} = [x_1, \ldots, x_b]$ denote the reference token block. At denoising iteration $k$ , for each masked position $i$ , the model outputs a probability distribution $p_i^{(t_k)} = \mathrm{softmax}(z_i^{(t_k)})$ over vocabulary size $V$ using embedding matrix $E \in \mathbb{R}^{V \times D}$ . Let $S_k$ denote the indices of the $m$ most confident predictions, as determined by $c_i^{(t_k)} = \max_j p_{i,j}^{(t_k)}$ . The remasked set is $R_k = \{ i \notin S_k \}$ .

For each $i \in R_k$ , RCD computes a D-dimensional residual vector: $r_i^{(t_k)} = f_{\text{res}}(p_i^{(t_k)}) = \sum_{j=1}^V p_{i,j}^{(t_k)} E_{j,\cdot}$ The strength of each residual is modulated using normalized Shannon entropy: $\alpha_i^{(t_k)} = \hat{H}(p_i^{(t_k)}) = H(p_i^{(t_k)}) / \log V = -\frac{1}{\log V} \sum_{j=1}^V p_{i,j}^{(t_k)} \log p_{i,j}^{(t_k)}$ At the following iteration, masked input embeddings are interpolated: $\tilde{e}_i^{(t_{k+1})} = \begin{cases} (1 - \alpha_i^{(t_k)})\,E([M]) + \alpha_i^{(t_k)}\,r_i^{(t_k)}, & x_i^{(t_{k+1})}=[M] \ E(x_i^{(t_{k+1})}), & \text{otherwise} \end{cases}$ This mechanism allows the model to attend over a block in which unresolved tokens are enriched with learned contextual priors derived from past predictions.

3. Two-Stage Decoupled Training Architecture

End-to-end differentiable training through the entire multi-step diffusion loop presents memory bottlenecks due to backpropagation through time. RCD circumvents this via a two-stage “hint” training scheme:

Stage 1: A lightweight reference model $M_{\text{ref}}$ is fine-tuned with the standard masked diffusion objective to generate high-quality proxy residuals and entropy scores.
Stage 2: The target model $M_{\text{tgt}}$ (initialized from the same base checkpoint) is trained by sampling masked inputs, running $M_{\text{ref}}$ to compute residuals $(p_i^{(t)}, \alpha_i^{(t)}, r_i^{(t)})$ , constructing RCD input embeddings $\{\tilde{e}_i^{(t)}\}$ , and optimizing cross-entropy over masked positions without propagating gradients through the residual computation. This decoupled approach avoids memory overhead while preserving signal utility.

4. Algorithmic Pipeline and Runtime Integration

The RCD inference process wraps the standard block-wise diffusion loop, operating as follows:

Initialize with a fully masked block; compute initial residuals and entropy from $M_{\text{ref}}$ .
For each of $K$ denoising iterations: a. Construct input embeddings for each position using the entropy-weighted interpolation between the mask embedding and the computed residual. b. Run $M_{\text{tgt}}$ to obtain new logits, distributions, and confidences. c. Commit the top- $m$ tokens and remask the rest. d. Update residuals for newly remasked positions using a softmax with temperature $T_{\text{res}}$ .

The computational overhead consists of a weighted sum over the codebook to generate $r_i$ and a linear interpolation to inject the residuals, resulting in less than 5% relative overhead per iteration.

5. Empirical Evaluation and Benchmarking

RCD was evaluated on two primary dLLM paradigms:

SDAR (semi-autoregressive, block-wise): 4B/8B parameters, block sizes 32/64, sequence length 16,384, across GSM8K, MATH500, AIME24/25.
LLaDA (bidirectional diffusion): 8B parameters, length up to 1,024, tasks GSM8K and MinervaMath.

Key results include:

Model / Metric	Baseline (%)	RCD (%)	Gain (Δ)
SDAR-8B-b64 on GSM8K	82.87	88.70	+5.8
SDAR-8B-b64 on MATH500	64.2	73.6	+9.4
SDAR-8B-b64 on AIME25	9.79	19.79	+10.0
LLaDA-8B on GSM8K	75.74	78.09	+2.35
LLaDA-8B on MinervaMath	31.10	37.00	+5.9

Additionally, RCD enables up to 4–5× fewer denoising steps for the same accuracy. Throughput-matched evaluation (using Fastdllm and D2F) shows only a 3–7% reduction in tokens-per-second, contrasting with consistent 2–9 point accuracy gains. Extending training epochs for baselines yields only marginal improvement; RCD surpasses such extensions, indicating information loss from remasking is a more critical bottleneck than training data volume (Hu et al., 30 Jan 2026).

6. Comparative Analysis and Ablations

RCD consistently outperforms vanilla dLLMs by 5–10 percentage points in accuracy and dominates Pareto frontiers (accuracy vs. tokens-per-step). When compared to “Loopholing” (reuse of hidden states), RCD attains coherent output and superior scores even under tight resource constraints (e.g., SDAR-4B-b64: 85.9% with RCD on GSM8K, versus incoherent output with Loopholing). Ablation studies reveal that alternative weighting methods (fixed linear, confidence-based, inverse) underperform relative to entropy-based weighting, which optimally balances accuracy and latency. Extending baseline training epochs does not match the efficacy of RCD, substantiating the importance of recycling residual contextual information over relying solely on increased data exposure.

7. Practical Considerations and Future Directions

RCD introduces minimal additional computational demand per inference iteration while frequently reducing total wall-clock time via higher effective tokens-per-step. This method can be efficiently retrofitted to an existing dLLM with only approximately one billion additional training tokens, enabling practical adoption. Empirical results across long and short chain-of-thought reasoning benchmarks underscore its robustness and scalability. A plausible implication is that further exploration of contextual residual learning could unlock additional efficiency gains in other classes of masked or parallel generative models.

Markdown Report Issue Upgrade to Chat

References (1)

Residual Context Diffusion Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Context Diffusion (RCD).