Papers
Topics
Authors
Recent
2000 character limit reached

Two-Stage Diffusion-to-AR Alignment

Updated 19 December 2025
  • The paper introduces a two-stage D2A alignment that trains a discrete diffusion model to mimic AR continuation, enabling efficient speculative decoding.
  • It employs Stage I for AR-style continuation distillation and Stage II for targeted refinement of draft boundaries, significantly improving block acceptance rates.
  • Empirical results show up to 5.54× speedup and longer accepted token blocks, ensuring lossless decoding with precise AR verification.

Two-stage Diffusion-to-Autoregressive (Diffusion-to-AR, or D2A) alignment is a training paradigm designed to align a discrete diffusion LLM (dLLM) with a target autoregressive (AR) model for the purpose of efficient speculative decoding. This procedure is central to the DEER framework, which seeks to draft long, blockwise continuations in a single step via dLLM while verifying each proposal with an exact AR filter, ensuring lossless decoding and significant acceleration compared to conventional AR or AR-drafter-based speculative decoding methods (Cheng et al., 17 Dec 2025).

1. Objectives and Speculative Decoding Context

Diffusion-to-AR alignment addresses a fundamental efficiency constraint in LLM systems: the sequential latency of AR decoding. Traditional speculative decoding improves throughput by employing a draft-then-verify mechanism, where a drafter proposes token blocks that are subsequently verified for correctness by an AR model. However, when both drafting and verifying use AR models, two intrinsic problems limit speedups: (1) step-wise uncertainty accumulation leading to reduced acceptance of draft blocks, and (2) inherently sequential decoding of the AR drafter itself. By leveraging dLLMs, which sample blocks in parallel, and aligning them closely to the AR target, D2A alignment resolves both issues, enabling single-step, blockwise drafting with high compatibility for AR verification (Cheng et al., 17 Dec 2025).

2. Two-Stage Training Pipeline

The D2A alignment pipeline consists of two distinct yet complementary stages enabling a dLLM (pretrained in discrete space) to match the AR continuation style (Stage I) and further concentrate modeling capacity on tokens critical to AR verification (Stage II).

2.1 Stage I: AR-Style Continuation Distillation

Goal: Remove the inherent global denoising bias of vanilla diffusion LLMs, enforcing an AR-conditioned continuation regime. Given a prefix x1:x_{1:\ell} and a [SEP] token, the dLLM is trained to predict the suffix exactly as the AR model (PAR) would.

Procedure:

  • For each example A=(a1,...,aL)A = (a_1, ..., a_L), select a random truncation position [1,L1]\ell \sim [1, L-1].
  • Generate x0x^0 by preserving a1:a_{1:\ell}, masking a+1:La_{\ell+1:L}, and appending [SEP].
  • Sample a diffusion noise step tt \sim Uniform{1,...,T}\{1, ..., T\} and noisy input xtq(xtx0)x^t \sim q(x^t|x^0).
  • Update dLLM parameters to recover the masked suffix given xtx^t.

Loss Function: Lstage1=Et,x0,xti=+1L1[xi0=M]logpe(xi0xt)\mathcal{L}_{\mathrm{stage1}} = -\,\mathbb{E}_{t,x^0,x^t} \sum_{i=\ell+1}^{L} \mathbf{1}[x^0_i = M]\,\log p_e(x^0_i\mid x^t)

Only the diffusion head is finetuned; the target AR model remains frozen.

2.2 Stage II: Prefix-Conditioned Scribe Refinement

Goal: Enhance local fidelity at draft-acceptance boundaries, where AR verification sensitivity peaks.

Procedure:

  • For each answer, select RR \sim Uniform(1,Rmax)(1, R_{\mathrm{max}}) (e.g., Rmax=96R_{\mathrm{max}}=96).
  • The context is a1:(LR)a_{1:(L-R)}; mask only the final RR tokens (plus [SEP]).
  • Exponentially weighted loss:

Wi=QRi,i=1,...,RW_i = Q^{R-i}, \quad i=1, ..., R

  • Train on suffix positions with weights WiW_i emphasizing proximity to prefix boundary.

Loss Function: Lstage2=Et,x0,xti=LR+1LWi(LR)1[xi0=M]logpe(xi0xt)\mathcal{L}_{\mathrm{stage2}} = -\,\mathbb{E}_{t,x^0,x^t} \sum_{i=L-R+1}^{L} W_{i-(L-R)}\,\mathbf{1}[x^0_i=M]\,\log p_e(x^0_i\mid x^t)

In both stages, one-step denoising from full MASK at inference yields a complete block of kk tokens in parallel.

3. Architectural and Algorithmic Details

The pipeline starts from a standard pretrained discrete dLLM (e.g., Open-dLLM’s 0.5B checkpoint). Only the diffusion head is updated in both stages. Training samples can be drawn from the target AR model (PAR) or any instruction-tuning corpus.

  • Forward kernel and noise schedule are kept consistent across training and inference, mirroring the base dLLM’s original configuration (e.g., uniform β\beta schedule, multinomial noise).
  • One-step blockwise decoding leverages the independence of positions under the dLLM draft—sampled as xT=MASKx^T = MASK and denoised to x0x^0 in a single pass—rather than propagating errors auto-regressively.
  • At inference, the proposal is verified by the AR model, which either accepts or rejects each token sequentially, preserving the AR distribution exactly and guaranteeing lossless speculative decoding (Cheng et al., 17 Dec 2025).

4. Empirical Outcomes and Quantitative Impact

The two-stage D2A alignment in DEER yields substantial empirical improvements over prior speculative decoding frameworks such as EAGLE-3. Empirical results, using Qwen3-30B-A3B as baseline AR model, are summarized below:

Framework Max Block Length HumanEval Speedup Avg. Acceptance Length TT
EAGLE-3 10 tokens 2.41× 3.21
DEER 32 tokens 5.54× 6.58

Stage II provides further acceptance-rate improvements, especially for positions nearest the prefix boundary. The following table benchmarks average acceptance lengths before and after Stage II refinement:

Benchmark Without Stage II With Stage II
MBPP 4.74 4.87
CodeAlpacaPy 3.47 4.04
HumanEval 5.38 6.58
LiveCodeBench 3.87 5.03

This demonstrates that D2A alignment not only enables substantially longer block acceptance but also leads to significant speedups in practical LLM inference (Cheng et al., 17 Dec 2025).

5. Theoretical Analysis

The necessity of two-stage alignment arises from the underlying differences between global denoising (diffusion) and local AR continuation conditioning. Stage I corrects for dLLMs' bias by forcing prefix+[SEP] treatment as “past,” aligning the dLLM draft distribution qe(yix1:j)q_e(y_i|x_{1:j}) to AR conditionals PAR(yix1:j,y1:i1)P_{\mathrm{AR}}(y_i| x_{1:j}, y_{1:i-1}).

Stage II focuses model capacity on the most verification-sensitive positions via exponentially decaying loss weights, crucial for accurate speculative acceptance. In AR drafters, errors in initial tokens propagate and reduce blockwise acceptance due to uncertainty accumulation. In contrast, the one-step diffusion draft maintains each position’s KL-divergence with AR bounded—even as block length grows—preventing progressive acceptance collapse. This mechanism enables DEER to maintain high acceptance rates for blocks up to 32 tokens while remaining lossless (Cheng et al., 17 Dec 2025).

6. Limitations and Prospective Advances

Stage II’s weighting parameter QQ (e.g., Q=1.01Q=1.01) requires careful selection for stable training. Aggressive weighting may destabilize convergence (as documented in Figure 1 of (Cheng et al., 17 Dec 2025)). Additionally, lack of efficient key–value (KV) cache support in current inference frameworks for discrete diffusion models limits realized batch throughput.

Future directions include integrating KV-cache optimized diffusion inference (e.g., Fast-dLLM, dInfer), developing adaptive noise schedules responsive to varying prefix lengths, extending D2A alignment to multi-step or hybrid diffusion-AR generation regimes, and experimenting with alternative weighting schemas or masking curricula (e.g., linear or sinusoidal weights) in Stage II (Cheng et al., 17 Dec 2025).

A plausible implication is the broader applicability of D2A alignment for efficient lossless speculative decoding in emerging LLM architectures, conditional on alignment mechanisms maintaining exact AR output distributions under increasingly parallel drafting paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Two-Stage Diffusion-to-AR Alignment.