Two-Stage Diffusion-to-AR Alignment

Updated 19 December 2025

The paper introduces a two-stage D2A alignment that trains a discrete diffusion model to mimic AR continuation, enabling efficient speculative decoding.
It employs Stage I for AR-style continuation distillation and Stage II for targeted refinement of draft boundaries, significantly improving block acceptance rates.
Empirical results show up to 5.54× speedup and longer accepted token blocks, ensuring lossless decoding with precise AR verification.

Two-stage Diffusion-to-Autoregressive (Diffusion-to-AR, or D2A) alignment is a training paradigm designed to align a discrete diffusion LLM (dLLM) with a target autoregressive (AR) model for the purpose of efficient speculative decoding. This procedure is central to the DEER framework, which seeks to draft long, blockwise continuations in a single step via dLLM while verifying each proposal with an exact AR filter, ensuring lossless decoding and significant acceleration compared to conventional AR or AR-drafter-based speculative decoding methods (Cheng et al., 17 Dec 2025).

1. Objectives and Speculative Decoding Context

Diffusion-to-AR alignment addresses a fundamental efficiency constraint in LLM systems: the sequential latency of AR decoding. Traditional speculative decoding improves throughput by employing a draft-then-verify mechanism, where a drafter proposes token blocks that are subsequently verified for correctness by an AR model. However, when both drafting and verifying use AR models, two intrinsic problems limit speedups: (1) step-wise uncertainty accumulation leading to reduced acceptance of draft blocks, and (2) inherently sequential decoding of the AR drafter itself. By leveraging dLLMs, which sample blocks in parallel, and aligning them closely to the AR target, D2A alignment resolves both issues, enabling single-step, blockwise drafting with high compatibility for AR verification (Cheng et al., 17 Dec 2025).

2. Two-Stage Training Pipeline

The D2A alignment pipeline consists of two distinct yet complementary stages enabling a dLLM (pretrained in discrete space) to match the AR continuation style (Stage I) and further concentrate modeling capacity on tokens critical to AR verification (Stage II).

2.1 Stage I: AR-Style Continuation Distillation

Goal: Remove the inherent global denoising bias of vanilla diffusion LLMs, enforcing an AR-conditioned continuation regime. Given a prefix $x_{1:\ell}$ and a [SEP] token, the dLLM is trained to predict the suffix exactly as the AR model (PAR) would.

Procedure:

For each example $A = (a_1, ..., a_L)$ , select a random truncation position $\ell \sim [1, L-1]$ .
Generate $x^0$ by preserving $a_{1:\ell}$ , masking $a_{\ell+1:L}$ , and appending [SEP].
Sample a diffusion noise step $t \sim$ Uniform $\{1, ..., T\}$ and noisy input $x^t \sim q(x^t|x^0)$ .
Update dLLM parameters to recover the masked suffix given $x^t$ .

Loss Function: $\mathcal{L}_{\mathrm{stage1}} = -\,\mathbb{E}_{t,x^0,x^t} \sum_{i=\ell+1}^{L} \mathbf{1}[x^0_i = M]\,\log p_e(x^0_i\mid x^t)$

Only the diffusion head is finetuned; the target AR model remains frozen.

Goal: Enhance local fidelity at draft-acceptance boundaries, where AR verification sensitivity peaks.

Procedure:

For each answer, select $R \sim$ Uniform $(1, R_{\mathrm{max}})$ (e.g., $R_{\mathrm{max}}=96$ ).
The context is $a_{1:(L-R)}$ ; mask only the final $R$ tokens (plus [SEP]).
Exponentially weighted loss:

$W_i = Q^{R-i}, \quad i=1, ..., R$

Train on suffix positions with weights $W_i$ emphasizing proximity to prefix boundary.

Loss Function: $\mathcal{L}_{\mathrm{stage2}} = -\,\mathbb{E}_{t,x^0,x^t} \sum_{i=L-R+1}^{L} W_{i-(L-R)}\,\mathbf{1}[x^0_i=M]\,\log p_e(x^0_i\mid x^t)$

In both stages, one-step denoising from full MASK at inference yields a complete block of $k$ tokens in parallel.

3. Architectural and Algorithmic Details

The pipeline starts from a standard pretrained discrete dLLM (e.g., Open-dLLM’s 0.5B checkpoint). Only the diffusion head is updated in both stages. Training samples can be drawn from the target AR model (PAR) or any instruction-tuning corpus.

Forward kernel and noise schedule are kept consistent across training and inference, mirroring the base dLLM’s original configuration (e.g., uniform $\beta$ schedule, multinomial noise).
One-step blockwise decoding leverages the independence of positions under the dLLM draft—sampled as $x^T = MASK$ and denoised to $x^0$ in a single pass—rather than propagating errors auto-regressively.
At inference, the proposal is verified by the AR model, which either accepts or rejects each token sequentially, preserving the AR distribution exactly and guaranteeing lossless speculative decoding (Cheng et al., 17 Dec 2025).

4. Empirical Outcomes and Quantitative Impact

The two-stage D2A alignment in DEER yields substantial empirical improvements over prior speculative decoding frameworks such as EAGLE-3. Empirical results, using Qwen3-30B-A3B as baseline AR model, are summarized below:

Framework	Max Block Length	HumanEval Speedup	Avg. Acceptance Length $T$
EAGLE-3	10 tokens	2.41×	3.21
DEER	32 tokens	5.54×	6.58

Stage II provides further acceptance-rate improvements, especially for positions nearest the prefix boundary. The following table benchmarks average acceptance lengths before and after Stage II refinement:

Benchmark	Without Stage II	With Stage II
MBPP	4.74	4.87
CodeAlpacaPy	3.47	4.04
HumanEval	5.38	6.58
LiveCodeBench	3.87	5.03

This demonstrates that D2A alignment not only enables substantially longer block acceptance but also leads to significant speedups in practical LLM inference (Cheng et al., 17 Dec 2025).

5. Theoretical Analysis

The necessity of two-stage alignment arises from the underlying differences between global denoising (diffusion) and local AR continuation conditioning. Stage I corrects for dLLMs' bias by forcing prefix+[SEP] treatment as “past,” aligning the dLLM draft distribution $q_e(y_i|x_{1:j})$ to AR conditionals $P_{\mathrm{AR}}(y_i| x_{1:j}, y_{1:i-1})$ .

Stage II focuses model capacity on the most verification-sensitive positions via exponentially decaying loss weights, crucial for accurate speculative acceptance. In AR drafters, errors in initial tokens propagate and reduce blockwise acceptance due to uncertainty accumulation. In contrast, the one-step diffusion draft maintains each position’s KL-divergence with AR bounded—even as block length grows—preventing progressive acceptance collapse. This mechanism enables DEER to maintain high acceptance rates for blocks up to 32 tokens while remaining lossless (Cheng et al., 17 Dec 2025).

6. Limitations and Prospective Advances

Stage II’s weighting parameter $Q$ (e.g., $Q=1.01$ ) requires careful selection for stable training. Aggressive weighting may destabilize convergence (as documented in Figure 1 of (Cheng et al., 17 Dec 2025)). Additionally, lack of efficient key–value (KV) cache support in current inference frameworks for discrete diffusion models limits realized batch throughput.

Future directions include integrating KV-cache optimized diffusion inference (e.g., Fast-dLLM, dInfer), developing adaptive noise schedules responsive to varying prefix lengths, extending D2A alignment to multi-step or hybrid diffusion-AR generation regimes, and experimenting with alternative weighting schemas or masking curricula (e.g., linear or sinusoidal weights) in Stage II (Cheng et al., 17 Dec 2025).

A plausible implication is the broader applicability of D2A alignment for efficient lossless speculative decoding in emerging LLM architectures, conditional on alignment mechanisms maintaining exact AR output distributions under increasingly parallel drafting paradigms.

PDF Markdown Chat (Pro)

References (1)

DEER: Draft with Diffusion, Verify with Autoregressive Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Two-Stage Diffusion-to-AR Alignment.

Two-Stage Diffusion-to-AR Alignment

1. Objectives and Speculative Decoding Context

2. Two-Stage Training Pipeline

2.1 Stage I: AR-Style Continuation Distillation

2.2 Stage II: Prefix-Conditioned Scribe Refinement

3. Architectural and Algorithmic Details

4. Empirical Outcomes and Quantitative Impact

5. Theoretical Analysis

6. Limitations and Prospective Advances

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Two-Stage Diffusion-to-AR Alignment

1. Objectives and Speculative Decoding Context

2. Two-Stage Training Pipeline

2.1 Stage I: AR-Style Continuation Distillation

2.2 Stage II: Prefix-Conditioned Scribe Refinement

3. Architectural and Algorithmic Details

4. Empirical Outcomes and Quantitative Impact

5. Theoretical Analysis

6. Limitations and Prospective Advances

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics