Two-Stage Diffusion-to-AR Alignment
- The paper introduces a two-stage D2A alignment that trains a discrete diffusion model to mimic AR continuation, enabling efficient speculative decoding.
- It employs Stage I for AR-style continuation distillation and Stage II for targeted refinement of draft boundaries, significantly improving block acceptance rates.
- Empirical results show up to 5.54× speedup and longer accepted token blocks, ensuring lossless decoding with precise AR verification.
Two-stage Diffusion-to-Autoregressive (Diffusion-to-AR, or D2A) alignment is a training paradigm designed to align a discrete diffusion LLM (dLLM) with a target autoregressive (AR) model for the purpose of efficient speculative decoding. This procedure is central to the DEER framework, which seeks to draft long, blockwise continuations in a single step via dLLM while verifying each proposal with an exact AR filter, ensuring lossless decoding and significant acceleration compared to conventional AR or AR-drafter-based speculative decoding methods (Cheng et al., 17 Dec 2025).
1. Objectives and Speculative Decoding Context
Diffusion-to-AR alignment addresses a fundamental efficiency constraint in LLM systems: the sequential latency of AR decoding. Traditional speculative decoding improves throughput by employing a draft-then-verify mechanism, where a drafter proposes token blocks that are subsequently verified for correctness by an AR model. However, when both drafting and verifying use AR models, two intrinsic problems limit speedups: (1) step-wise uncertainty accumulation leading to reduced acceptance of draft blocks, and (2) inherently sequential decoding of the AR drafter itself. By leveraging dLLMs, which sample blocks in parallel, and aligning them closely to the AR target, D2A alignment resolves both issues, enabling single-step, blockwise drafting with high compatibility for AR verification (Cheng et al., 17 Dec 2025).
2. Two-Stage Training Pipeline
The D2A alignment pipeline consists of two distinct yet complementary stages enabling a dLLM (pretrained in discrete space) to match the AR continuation style (Stage I) and further concentrate modeling capacity on tokens critical to AR verification (Stage II).
2.1 Stage I: AR-Style Continuation Distillation
Goal: Remove the inherent global denoising bias of vanilla diffusion LLMs, enforcing an AR-conditioned continuation regime. Given a prefix and a [SEP] token, the dLLM is trained to predict the suffix exactly as the AR model (PAR) would.
Procedure:
- For each example , select a random truncation position .
- Generate by preserving , masking , and appending [SEP].
- Sample a diffusion noise step Uniform and noisy input .
- Update dLLM parameters to recover the masked suffix given .
Loss Function:
Only the diffusion head is finetuned; the target AR model remains frozen.
2.2 Stage II: Prefix-Conditioned Scribe Refinement
Goal: Enhance local fidelity at draft-acceptance boundaries, where AR verification sensitivity peaks.
Procedure:
- For each answer, select Uniform (e.g., ).
- The context is ; mask only the final tokens (plus [SEP]).
- Exponentially weighted loss:
- Train on suffix positions with weights emphasizing proximity to prefix boundary.
Loss Function:
In both stages, one-step denoising from full MASK at inference yields a complete block of tokens in parallel.
3. Architectural and Algorithmic Details
The pipeline starts from a standard pretrained discrete dLLM (e.g., Open-dLLM’s 0.5B checkpoint). Only the diffusion head is updated in both stages. Training samples can be drawn from the target AR model (PAR) or any instruction-tuning corpus.
- Forward kernel and noise schedule are kept consistent across training and inference, mirroring the base dLLM’s original configuration (e.g., uniform schedule, multinomial noise).
- One-step blockwise decoding leverages the independence of positions under the dLLM draft—sampled as and denoised to in a single pass—rather than propagating errors auto-regressively.
- At inference, the proposal is verified by the AR model, which either accepts or rejects each token sequentially, preserving the AR distribution exactly and guaranteeing lossless speculative decoding (Cheng et al., 17 Dec 2025).
4. Empirical Outcomes and Quantitative Impact
The two-stage D2A alignment in DEER yields substantial empirical improvements over prior speculative decoding frameworks such as EAGLE-3. Empirical results, using Qwen3-30B-A3B as baseline AR model, are summarized below:
| Framework | Max Block Length | HumanEval Speedup | Avg. Acceptance Length |
|---|---|---|---|
| EAGLE-3 | 10 tokens | 2.41× | 3.21 |
| DEER | 32 tokens | 5.54× | 6.58 |
Stage II provides further acceptance-rate improvements, especially for positions nearest the prefix boundary. The following table benchmarks average acceptance lengths before and after Stage II refinement:
| Benchmark | Without Stage II | With Stage II |
|---|---|---|
| MBPP | 4.74 | 4.87 |
| CodeAlpacaPy | 3.47 | 4.04 |
| HumanEval | 5.38 | 6.58 |
| LiveCodeBench | 3.87 | 5.03 |
This demonstrates that D2A alignment not only enables substantially longer block acceptance but also leads to significant speedups in practical LLM inference (Cheng et al., 17 Dec 2025).
5. Theoretical Analysis
The necessity of two-stage alignment arises from the underlying differences between global denoising (diffusion) and local AR continuation conditioning. Stage I corrects for dLLMs' bias by forcing prefix+[SEP] treatment as “past,” aligning the dLLM draft distribution to AR conditionals .
Stage II focuses model capacity on the most verification-sensitive positions via exponentially decaying loss weights, crucial for accurate speculative acceptance. In AR drafters, errors in initial tokens propagate and reduce blockwise acceptance due to uncertainty accumulation. In contrast, the one-step diffusion draft maintains each position’s KL-divergence with AR bounded—even as block length grows—preventing progressive acceptance collapse. This mechanism enables DEER to maintain high acceptance rates for blocks up to 32 tokens while remaining lossless (Cheng et al., 17 Dec 2025).
6. Limitations and Prospective Advances
Stage II’s weighting parameter (e.g., ) requires careful selection for stable training. Aggressive weighting may destabilize convergence (as documented in Figure 1 of (Cheng et al., 17 Dec 2025)). Additionally, lack of efficient key–value (KV) cache support in current inference frameworks for discrete diffusion models limits realized batch throughput.
Future directions include integrating KV-cache optimized diffusion inference (e.g., Fast-dLLM, dInfer), developing adaptive noise schedules responsive to varying prefix lengths, extending D2A alignment to multi-step or hybrid diffusion-AR generation regimes, and experimenting with alternative weighting schemas or masking curricula (e.g., linear or sinusoidal weights) in Stage II (Cheng et al., 17 Dec 2025).
A plausible implication is the broader applicability of D2A alignment for efficient lossless speculative decoding in emerging LLM architectures, conditional on alignment mechanisms maintaining exact AR output distributions under increasingly parallel drafting paradigms.