Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaDA2.1: Diffusion LLM with Editable Decoding

Updated 10 February 2026
  • LLaDA2.1 is a discrete diffusion large language model that integrates an editable draft-and-edit decoding paradigm with a block-diffusion Transformer backbone.
  • It employs configurable dual-threshold decoding with Speedy Mode for rapid output and Quality Mode for precision, balancing speed and generation quality.
  • The model introduces an ELBO-based reinforcement learning alignment framework to enhance task performance and error correction in complex generation scenarios.

LLaDA2.1 is a discrete diffusion LLM (dLLM) architecture that augments block-diffusion Transformers with an editable “draft-and-edit” decoding paradigm and introduces the first large-scale reinforcement learning (RL) alignment framework specifically for diffusion LLMs. Building upon the 100B-parameter block-diffusion backbone of LLaDA2.0, LLaDA2.1 addresses the longstanding challenge of balancing decoding speed and generation quality by integrating a configurable threshold-based dual-stream decoding, offering two operational modes—Speedy Mode (S-Mode) and Quality Mode (Q-Mode)—and deploying an ELBO-based policy optimization approach for policy alignment. LLaDA2.1 is released in two variants: LLaDA2.1-Flash (100B parameters) and LLaDA2.1-Mini (16B parameters), demonstrating high task performance and substantially accelerated decoding, especially in code generation benchmarks (Bie et al., 9 Feb 2026).

1. Block-Diffusion Backbone and Model Architecture

LLaDA2.1 retains the core masked-diffusion Transformer structure from LLaDA2.0. Input sequences are partitioned into BB non-overlapping blocks, and at each discrete diffusion timestep tt, the model performs full self-attention within each block, while only attending to preceding blocks via a block-causal mask MM. This blockwise factorization enables inter-block computational parallelism while maintaining causal dependency along the sequence prefix.

A central architectural innovation is the editable “draft-and-edit” paradigm. In contrast to standard absorbing-state diffusion, which only allows transitions [MASK]token[\text{MASK}] \rightarrow \text{token}, LLaDA2.1 introduces a Token-to-Token (T2T) edit operation, allowing retroactive correction of low-confidence placements. This is formalized as follows: let vti=argmaxvpθ(vxt)v_t^i = \arg\max_v p_\theta(v \mid x_t) denote the model's top candidate at position ii at time tt. Two index sets, Γt\Gamma_t and Δt\Delta_t, are defined at each step by separate confidence thresholds τM2T\tau_{\text{M2T}} (mask-to-token) and τT2T\tau_{\text{T2T}} (token-to-token):

Γt={ixti=[MASK]    and    pθ(vtixt)>τM2T}\Gamma_t = \{i \mid x_t^i = [\text{MASK}] \;\; \text{and} \;\; p_\theta(v_t^i \mid x_t) > \tau_{\text{M2T}}\}

Δt={ixtivti    and    pθ(vtixt)>τT2T}\Delta_t = \{i \mid x_t^i \neq v_t^i \;\; \text{and} \;\; p_\theta(v_t^i \mid x_t) > \tau_{\text{T2T}}\}

All positions in ΓtΔt\Gamma_t \cup \Delta_t are updated in parallel.

Dual-stream training is employed: the “drafting stream” teaches the model to predict from [MASK]token[\text{MASK}] \rightarrow \text{token}, and the “editing stream” from noisy-token\rightarrowcorrect-token. The total training loss over both Continual Pre-Training (CPT) and Supervised Fine-Tuning (SFT) is a weighted combination of cross-entropy objectives:

Ldiff=Etschedule[wMM2T(θ)+wTT2T(θ)]\mathcal{L}_{\text{diff}} = \mathbb{E}_{t\sim \text{schedule}} [w_M \ell_{\text{M2T}}(\theta) + w_T\ell_{\text{T2T}}(\theta)]

Here, wMw_M and wTw_T are set so that both mask-filling and error-correction scenarios occur frequently. The model also employs a “Multi-turn Forward” augmentation that chains editing steps during training to further augment its editing capacity.

2. Configurable Dual-Threshold Decoding: Speedy and Quality Modes

At inference, the thresholds (τM2T,τT2T)(\tau_{\text{M2T}}, \tau_{\text{T2T}}) become user-configurable levers central to LLaDA2.1’s flexibility. The two main operational regimes are:

  • Speedy Mode (S-Mode): Employs a low mask threshold τM2TS\tau_{\text{M2T}}^S to aggressively “draft” by filling many positions per step, and a moderate τT2TS\tau_{\text{T2T}}^S to restrict edits to high-confidence swaps. This regime yields maximal tokens-per-forward (TPF) and throughput, with only slight reductions in benchmarked output quality.
  • Quality Mode (Q-Mode): Both thresholds are raised (τM2TQ>τM2TS\tau_{\text{M2T}}^Q > \tau_{\text{M2T}}^S, τT2TQ>τT2TS\tau_{\text{T2T}}^Q > \tau_{\text{T2T}}^S) so that only high-confidence actions are taken, closely approximating or matching LLaDA2.0’s quality but requiring more steps.

Empirical results show that moving from Q-Mode to S-Mode approximately doubles the TPF (from ≈3.1 to ≈5.9 on the 100B model) while causing only a negligible (∼0.1–0.2 absolute) average score drop across benchmarks.

3. Reinforcement Learning Alignment via ELBO-Based Block-Level Policy Optimization

LLaDA2.1 implements the first large-scale RL alignment pipeline designed for discrete diffusion LLMs, targeting advanced instruction-following and reasoning alignment. The RL stage utilizes ELBO-based Block-level Policy Optimization (EBPO). Due to the intractability of exact sequence likelihood computation in diffusion models, LLaDA2.1 uses the ELBO as a surrogate.

The likelihood ratio between updated and old policies is approximated as:

logρ(yx)n=1Nwnb=1B[logpθ(ybzn,x;M)logpθold(ybzn,x;M)]\log\rho(y|x) \approx \sum_{n=1}^N w_n \sum_{b=1}^B \left[ \log p_\theta(y^b | z_n, x; \mathcal{M}) - \log p_{\theta_{\text{old}}}(y^b | z_n, x; \mathcal{M}) \right]

where znz_n combines a partial sample at step tnt_n with the terminal state y0y_0, and M\mathcal{M} is the block-causal mask. The RL objective is:

JEBPO(θ)=Ex,yπθold[min(ρ(yx)A^,  clip(ρ(yx),1ε,1+ε)A^)]\mathcal{J}_{\text{EBPO}}(\theta) = \mathbb{E}_{x, y \sim \pi_{\theta_{\text{old}}}} \left[ \min \left( \rho(y|x) \hat{A}, \; \mathrm{clip}(\rho(y|x), 1-\varepsilon, 1+\varepsilon) \hat{A} \right)\right]

with A^\hat{A} as the estimated advantage. The final training objective combines the diffusion pretraining and policy optimization:

Ltotal(θ)=Ldiff(θ)λRLJEBPO(θ)\mathcal{L}_{\text{total}}(\theta) = \mathcal{L}_{\text{diff}}(\theta) - \lambda_{\text{RL}} \mathcal{J}_{\text{EBPO}}(\theta)

where λRL\lambda_{\text{RL}} controls the balance between fidelity and alignment. Efficient blockwise vectorization allows this RL framework to scale to 100B parameters and long contexts.

4. Empirical Performance and Benchmark Results

LLaDA2.1 demonstrates high task performance and significantly accelerated decoding, especially in code generation. Across 33 benchmarks spanning Knowledge, Reasoning, Coding, Math, and Agentic Alignment, the two main variants—LLaDA2.1-Flash (100B) and LLaDA2.1-Mini (16B)—achieve the following:

Benchmark Flash w/o Quant (TPS / ΔScore) Flash w Quant (TPS / ΔScore) Mini w/o Quant (TPS / ΔScore) Mini w Quant (TPS / ΔScore)
HumanEval+ 746.66 / –3.04 891.74 / –0.61 1496.67 / –0.61 1586.93 / –0.61
MBPP+ 639.47 / –1.85 761.38 / +1.85 1286.96 / +1.85 1303.96 / +1.85
CRUXEval-O 550.09 / –0.24 645.72 / –1.00 980.82 / –1.00 1063.94 / –1.00
BigCodeBench-F 691.14 / +1.06 801.48 / –0.09 1220.40 / –0.09 1307.45 / –0.09
LiveCodeBench 571.60 / –1.76 663.39 / +1.98 1015.82 / +1.98 1102.92 / +1.98

On the 100B Flash model in S-Mode with quantization, LLaDA2.1 achieves up to 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench—over 2× faster than LLaDA2.0-Flash—while losing less than 2 points of pass@1. The 16B Mini model achieves even higher throughput with minimal score degradation. Under Q-Mode, the Flash variant typically recovers or exceeds LLaDA2.0 performance at slightly elevated TPF, indicating the efficacy of editable decoding.

5. Broader Implications and Future Directions

LLaDA2.1 establishes a continuum between decoding speed and output quality for discrete diffusion LLMs via its joint M2T+T2T decoding scheme. Adjustable thresholds enable practitioners to optimize for throughput or rigor as dictated by downstream requirements. The introduction of EBPO-based RL alignment enhances alignment with complex human instructions and code correctness, reducing the gap between utility and interpretability for diffusion-based language generation.

A plausible implication is that this paradigm, coupling aggressive parallel generation with targeted retroactive self-correction, foreshadows future self-improving and human-aligned generative systems. The empirical finding that T2T editability can boost both speed and quality suggests further research directions into mixed-stream workflows. Moreover, the unambiguous breach of autoregressive inference speeds without sacrificing state-of-the-art quality highlights discrete diffusion as a viable alternative to traditional transformer decoding (Bie et al., 9 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaDA2.1.