DEER: Draft with Diffusion, Verify with Autoregressive Models (2512.15176v1)
Abstract: Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion LLM (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This paper introduces a way to make big LLMs answer faster without changing their answers. The method is called DEER. It uses a special “helper” model to quickly guess a chunk of the next words, and a main model to double-check them. The trick is that the helper is a diffusion LLM (think: it builds a whole chunk at once), not the usual type that writes word-by-word. This helps avoid slowdowns and reduces mistakes that pile up over time.
The main questions the paper asks
The researchers wanted to find out:
- Can a diffusion-based helper draft longer, more accurate chunks of text than traditional helpers that write word-by-word?
- Can this make the main model’s responses much faster, while keeping the same quality?
- How should we train this diffusion helper so its guesses match what the main model would write?
- Does the approach work on practical tasks like coding and math problems?
How the method works (in everyday language)
First, a few quick definitions to keep things clear:
- A “token” is a piece of text a model handles (often a word or part of a word).
- “Autoregressive” (AR) models write left-to-right, one token at a time.
- “Speculative decoding” is like having a fast friend draft a few words ahead, while a careful teacher checks and either accepts or fixes them.
- A “diffusion model” starts from a rough guess and “cleans it up” in a few steps; here, it can generate many tokens in parallel, like filling in a whole puzzle section at once.
The common approach today uses an AR helper that writes one token after another. This has two big problems:
- Mistakes stack up: If the helper gets an early token a little wrong, all later tokens depend on that mistake, so the draft drifts away from what the main model wants.
- It’s still slow: The helper writes tokens one-by-one, which limits speed.
DEER does something different:
- The helper is a diffusion LLM (dLLM) that proposes a whole block of tokens in one shot. Because it doesn’t depend on its own previous guesses, tiny errors don’t snowball.
- The main AR model then checks each token in that block and either accepts it or replaces it with its own token. This keeps the final output exactly as the main model would produce, just faster.
Training the diffusion helper so it “thinks” like the main model:
- Stage 1 (AR-style distillation): The helper is taught to continue from a given prefix (the text so far) in a way that matches the main model. The training data adds a special SEP marker so the helper knows where the main model stopped and where to continue.
- Stage 2 (refinement near the boundary): The helper is further trained to be extra-accurate right after the prefix—exactly where the main model will start checking. The loss (a measure of how wrong the helper is) puts more weight on these early continuation tokens so they’re very reliable.
At run time:
- The helper proposes a block of k tokens all at once.
- The main model goes through them token-by-token, accepting or replacing them. Because the helper’s later tokens don’t depend on earlier draft tokens, small mismatches don’t grow out of control.
What they found and why it matters
In simple terms: longer accepted drafts mean fewer checks and faster answers. The paper reports:
- Much longer accepted chunks:
- DEER often gets up to 32 tokens in a single accepted block (vs. about 7–10 for leading AR-based methods like EAGLE-3).
- On HumanEval (a coding benchmark) with a large model (Qwen3-30B-A3B), DEER’s average accepted length and speed were both clearly higher.
- Big speedups without hurting quality:
- On HumanEval with Qwen3-30B-A3B, DEER achieved about 5.54× speedup compared to standard decoding, while EAGLE-3 got about 2.41×.
- Across several model sizes and datasets, DEER consistently produced faster tokens-per-second and longer accepted blocks.
- It scales well in batches:
- When running multiple tasks at once (batching), DEER’s parallel block proposals use the GPU more effectively, boosting throughput.
- Works beyond coding:
- Even with a less-trained helper for math tasks, DEER still beat EAGLE-3 in speed and acceptance length on GSM8K, Math500, and Minerva Math.
- A neat bonus capability:
- The trained diffusion helper shows “reliable block regeneration,” meaning it can repeatedly fill in partially hidden endings cleanly—like smartly extending code blocks from a short prefix.
What this could change in the real world
- Faster assistants and agents: LLMs that need to think through long problems, write code, or handle multi-step tasks can respond much faster without changing their answers.
- Lower costs and energy: Doing more in parallel reduces total compute time. That can save money and make large models more practical to use.
- Better user experience: Shorter waiting times, smoother multi-step reasoning, and quicker code generation.
- A new path for accelerating LLMs: Using diffusion as the drafter solves the core weakness of AR helpers—errors piling up—and may become a standard tool for efficient LLM systems.
In short, DEER shows that “draft with diffusion, verify with autoregression” can make big models both fast and dependable, especially on long or complex tasks.
Knowledge Gaps
Unresolved Gaps, Limitations, and Open Questions
Below is a consolidated list of concrete, actionable gaps the paper leaves open for future research:
- Theory and correctness
- Provide a formal proof that DEER’s blockwise drafting with token-wise verification preserves exact marginal equivalence to the AR baseline under all sampling settings, and clearly specify the conditions (e.g., independence assumptions, acceptance probability formulas in Eqs. 5–6) under which the guarantee holds.
- Derive analytical bounds connecting KL divergence between the drafter and verifier distributions to expected acceptance length and speedup, and validate them empirically.
- Clarify and correct any ambiguities in the acceptance probability equations (Eqs. 5–6), including edge cases (e.g., negative or zero denominators) and their implementation details.
- Sampling and decoding controls
- Systematically evaluate DEER across diverse sampling regimes (temperature sweeps, top-p/top-k, logit bias, repetition penalty) and quantify their impact on acceptance length, speedup, and distributional fidelity.
- Investigate whether DEER remains lossless at higher temperatures and under non-greedy sampling, with statistical tests confirming output distribution equality to AR baselines.
- KV-cache and systems integration
- Design and benchmark an efficient KV-cache (or equivalent caching) mechanism for dLLMs, and quantify end-to-end speedups with concurrent dLLM+AR deployment under modern serving frameworks (e.g., PagedAttention, SGLang).
- Provide detailed latency breakdowns (drafting vs. verification), GPU occupancy, memory footprint, and energy consumption across hardware (A100, H100, consumer GPUs) and batch sizes.
- Long-context and domain generalization
- Evaluate DEER on very long contexts (e.g., 32k–128k tokens), measuring acceptance length scaling, throughput, and stability; analyze any degradation mechanisms unique to long prompts.
- Extend evaluation beyond code and math to natural language tasks (summarization, translation, long-form QA, dialogue) and multilingual settings; quantify cross-domain generalization.
- Drafter scaling laws and capacity planning
- Systematically study the effect of drafter size (e.g., 100M–1B+) on acceptance length and speedup across models and tasks; derive practical “minimal drafter” recommendations for different target model sizes.
- Analyze sensitivity to the choice of drafter backbone (e.g., Open-dLLM vs. converted AR checkpoints) and its training stage completion (partially converged vs. fully trained).
- Block size scheduling and optimization
- Develop and evaluate dynamic block size selection strategies that adapt k per step based on real-time acceptance statistics, verifier entropy, and throughput constraints; quantify trade-offs.
- Explore multi-block lookahead or hierarchical block proposals and their interaction with token-wise verification costs.
- Training stability and objectives
- Address the narrow stability window in Stage II (exponential weighting factor α), including alternative loss shaping (e.g., non-exponential weighting, curriculum schedules, contrastive alignment, verifier-informed distillation).
- Justify and ablate the choice of masked span size R ~ Uniform(1, 96) and the use of the SEP token; extend to longer spans and different masking curricula; report robustness across datasets.
- Reliability of emergent behaviors
- Rigorously quantify “reliable block regeneration” with appropriate metrics (e.g., coherence, local edit consistency) and ablation (with/without Stage II); assess its utility for iterative code or text refinement tasks.
- Investigate and model the reported “long-block resurgence effect” in acceptance-length distributions; test statistical significance and underlying causes (e.g., independence assumptions, verifier entropy changes).
- Output quality and safety
- Despite “lossless” claims, empirically verify output equivalence versus AR across tasks using statistical tests (e.g., distribution matching of log-likelihoods, exact-match rates under controlled sampling).
- Analyze robustness under adversarial or rare-token prefixes, domain shift, and noisy inputs; characterize failure modes (e.g., acceptance collapse) and propose detection and fallback strategies.
- EOS and boundary handling
- Validate correctness for variable-length sequences, early EOS within blocks, and boundary cases where verified tokens deviate; ensure no marginal bias at termination points.
- Structured and constrained decoding
- Examine compatibility with grammar-constrained decoding, programmatic verifiers, and tree/graph constraints; measure impacts on acceptance and speedup in structured generation.
- Multimodal applicability
- Extend DEER to multimodal LLMs (text+vision/audio) and continuous-token modalities; assess whether discrete dLLM drafting translates to these settings.
- Broader baselines and reproducibility
- Include additional baselines (e.g., Lookahead, DiffuSpec, Speculative Diffusion Decoding) under harmonized settings to isolate the benefits of discrete one-step diffusion vs. heuristic or multi-step approaches.
- Improve reproducibility: release code/configs, seeds, exact hardware/software stacks; clarify OOM findings for Medusa/Hydra (environment-specific) and provide tuned configs for fair comparisons.
- Conversion methods and architecture details
- Detail the methodology for converting AR checkpoints into discrete dLLMs (architectural changes, training recipes), and quantify resultant capability changes versus native dLLMs.
- Investigate single-step discrete diffusion viability across architectures and tasks, including when multi-step denoising is necessary and how it affects temperature control and alignment.
- Deployment and user-facing metrics
- Evaluate real-time/streaming latency, jitter, and user-perceived responsiveness under DEER; propose scheduling policies for balanced throughput and latency in interactive applications.
Glossary
- Acceptance length (T): The average number of drafted tokens accepted per speculative cycle; "Average Acceptance Length (T)"
- Acceptance probability: The probability the verifier accepts a drafted token; "we compute an acceptance probability"
- Acceptance region: The set of draft trajectories likely to be accepted by the verifier; "drifts outside the acceptance region"
- Autoregressive (AR) decoding: Token-by-token left-to-right generation where each token depends on previous ones; "inherent latency of autoregressive (AR) decoding"
- Batch inference scalability: The ability of a method to maintain or improve throughput as batch size increases; "Batch Inference Scalability (RQ3)"
- Blockwise generation: Producing multiple tokens in parallel as a single block rather than sequentially; "supports reliable blockwise generation"
- Cross-entropy loss: A standard training objective measuring divergence between predicted and true distributions; "cross-entropy loss"
- DEER: A speculative decoding framework that drafts with diffusion and verifies with autoregressive models; "We introduce DEER, the first speculative decoding framework"
- Diffusion LLM (dLLM): A LLM that generates text via a diffusion denoising process in discrete or continuous spaces; "diffusion LLM (dLLM)"
- Diffusion-to-Autoregressive (D2A) Alignment: A training pipeline to align a diffusion model’s outputs with an AR verifier’s distribution; "Diffusion-to-Autoregressive (D2A) Alignment pipeline"
- Denoising process: The generative procedure in diffusion models that reconstructs clean tokens from noisy inputs; "a single denoising process"
- Draft-verify scheme: A decoding approach where a lightweight drafter proposes tokens and a target model verifies them; "draft-verify scheme"
- Drafter: A lightweight model that proposes candidate continuations for speculative decoding; "AR draft models (a.k.a., drafters)"
- Distribution mismatch: Misalignment between the drafter’s proposal distribution and the verifier’s target distribution; "severe distribution mismatch"
- Kendall's T correlation: A rank correlation metric used to assess alignment or ordering consistency; "Kendall's T correlation"
- KL divergence: A measure of difference between two probability distributions; "distribution mis- match KL(PAR(Si | X1:j+i-1) |qAR(pi |X1:j,91:i-1))"
- KV cache: Key–value caching used to speed up transformer inference by reusing past attention states; "with KV cache"
- Lossless acceleration: Speedup that preserves exact output distribution of the target model; "providing lossless acceleration"
- Masked conditioning: Conditioning generation on an observed prefix while masking the future suffix; "block-wise masked conditioning"
- Mask token: A special token used to indicate positions that should be predicted or reconstructed; "M: mask token"
- N-gram: A contiguous sequence of tokens used for heuristic drafting or matching; "n-gram matching"
- Prefix-conditioned continuation: Generating a suffix conditioned only on a given prefix; "prefix-conditioned continuation"
- Rejection sampling: A method where proposed tokens are accepted or rejected to match a target distribution; "strict rejection sampling to guarantee lossless speculative decoding"
- Reliable block regeneration: An emergent capability to repeatedly regenerate coherent masked suffix blocks; "reliable block regeneration"
- SEP token: A special separator token marking the boundary of a truncated answer/prefix; "append a SEP token"
- Speculative decoding: An acceleration method that proposes multiple tokens for verification by a stronger model; "Speculative decoding accelerates autoregressive (AR) infer- ence"
- Suffix masking: Training or inference technique that masks a suffix for focused reconstruction; "weighted suffix masking"
- Temperature: A decoding parameter controlling randomness in sampling; "temperature=0"
- Tree-structured continuations: Drafting proposals organized as branching token trees; "tree-structured continuations"
- Uncertainty accumulation: Error propagation in sequential drafting that degrades alignment over positions; "left-to-right uncertainty accumulation"
- Verifier: The target AR model that validates or corrects drafted tokens; "the verifier to reject increasing portions of the draft"
- Verification boundary: The region near the prefix where accuracy most impacts acceptance; "AR verification boundary"
Practical Applications
Immediate Applications
Below are specific, deployable use cases that organizations can implement now, leveraging DEER’s diffusion-based drafting with AR verification and the two-stage alignment pipeline.
- Efficient LLM serving and lower latency chat/QA (software, cloud, customer support, finance, healthcare)
- What: Replace or augment existing speculative decoding (e.g., EAGLE-3, Medusa) with DEER to increase average accepted-token length and end-to-end throughput with exactness preserved.
- Tools/workflows: Integrate as an inference backend option in vLLM/SGLang/Triton servers; add a drafter sidecar (0.5B discrete dLLM) per target model; expose runtime knobs for block size, temperature, and acceptance monitoring.
- Assumptions/dependencies: Availability of discrete dLLM inference kernels; GPU memory to host a 0.5B drafter; training the drafter via Stage I/II on domain-aligned data; acceptance may vary with temperature and domain.
- Faster code assistants and IDE completions (software engineering)
- What: Use DEER to accelerate code completion and multi-line suggestions with longer accepted blocks in VSCode/JetBrains language servers.
- Tools/products: Code-specific dLLM drafter trained with Stage I/II on code corpora; plugin that surfaces latency and acceptance-length metrics; adaptive block size by file/language entropy.
- Assumptions/dependencies: High-quality code training data; target AR model with code specialization; guardrails for correctness still required (tests, linters).
- High-throughput batch generation jobs (data labeling, summarization, synthetic data, evaluation pipelines)
- What: Achieve higher tokens/s in offline workloads (summarization, data augmentation, test-time evaluation) by leveraging DEER’s superior batch scaling.
- Tools/workflows: Batch schedulers that co-optimize block size and batch size; acceptance-length telemetry; cost dashboards (tokens/s, $/1k tokens).
- Assumptions/dependencies: Diffusion drafter co-located with target AR model; minimal engineering to pipe acceptance outcomes into existing pipelines.
- Cost and carbon reduction in model serving (enterprise IT, ESG)
- What: Lower GPU hours and energy per request through higher acceptance lengths and fewer verifier calls; report gains as part of ESG metrics.
- Tools/workflows: Cost/energy attribution per request; AB tests comparing AR-only vs DEER; SLO-aware routing that prefers DEER when acceptance is high.
- Assumptions/dependencies: Accurate metering; similar quality targets; domain stability so acceptance doesn’t regress unexpectedly.
- Faster math tutoring and reasoning assistants (education)
- What: Use DEER to speed up step-by-step solutions in math tutoring apps while preserving exactness via AR verification.
- Tools/products: Small math-focused drafter (0.5B) aligned to the tutor’s AR backbone; classroom deployments to improve interactive latency.
- Assumptions/dependencies: Even partially converged dLLMs show gains, but quality is bounded by the AR verifier; curriculum-aligned data improves acceptance near prefixes.
- Safety-preserving acceleration for customer support and healthcare documentation (customer ops, healthcare IT)
- What: Maintain fidelity to the AR model while accelerating drafting; continue to run policy/safety checks on the verified sequence.
- Tools/workflows: Post-verification policy filters; acceptance-length thresholds to trigger stricter verification in sensitive contexts.
- Assumptions/dependencies: Compliance constraints on training data; safety filters remain mandatory; domain-specific drift can affect acceptance.
- MLOps observability and auto-tuning for speculative decoding (platform engineering)
- What: Productize DEER’s acceptance telemetry to auto-tune block size, temperature, and Stage II weighting (α) for stability.
- Tools/workflows: Dashboards for acceptance distribution, long-block “resurgence” alerts, and stability windows; per-endpoint tuning profiles.
- Assumptions/dependencies: Stage II shows a narrow stable α range (e.g., 1.01 vs 1.05 diverging); continuous monitoring is needed.
Long-Term Applications
These applications need further research, system support, or scaling (e.g., robust dLLM inference stacks, domain-aligned training data, or hardware support).
- Production-grade dLLM inference stacks with KV-like caching and fused kernels (software/hardware co-design)
- What: Build mature runtimes for discrete diffusion LLMs, including caching, tensor parallelism, and fused ops to fully exploit DEER at scale.
- Potential products: “dLLM-serve” frameworks; Triton kernels for discrete denoising; unified AR+diffusion scheduler.
- Dependencies: Library and kernel maturity; hardware-friendly diffusion ops; interoperability with AR verifiers’ KV caches.
- Domain-specialized drafters for regulated sectors (healthcare, legal, finance)
- What: Train Stage I/II-aligned drafters on compliant, domain-specific corpora to improve acceptance and latency under strict quality constraints.
- Potential workflows: Specialty endpoints (e.g., clinical summarization, KYC/AML rationale synthesis) with DEER-backed decoding.
- Dependencies: Access to compliant data; legal/privacy review; rigorous evaluation for domain generalization and failure modes.
- Interactive block regeneration for editing and refactoring (productivity, software engineering, creative tools)
- What: Leverage “reliable block regeneration” to enable chunk-level rewriting in code editors and document processors (e.g., refactor this function; rephrase this paragraph).
- Potential products: Block-aware IDE refactor tools; document editors with locality-preserving rewrite modes.
- Dependencies: UI/UX for partial masking; consistency checks and tests; integration with SCM and review workflows.
- Multi-agent and multi-service shared drafting layers (agentic systems, orchestration)
- What: Use a shared dLLM drafter across many agents/services, with specialized AR verifiers per task; amortize drafting costs while preserving task fidelity.
- Potential workflows: Central drafting microservice; verifier-specific acceptance policies; cross-agent block caching.
- Dependencies: Cross-task generalization; routing and isolation; acceptance variance across tasks.
- Privacy-preserving split inference (edge/cloud)
- What: Perform on-device diffusion drafting of long blocks; send only minimal verification context to the cloud AR model to reduce data exposure.
- Potential products: Privacy-first mobile assistants; enterprise BYOK policies with on-prem drafters.
- Dependencies: Device compute/memory for a 0.5B drafter; careful boundary selection; secure transport and auditing.
- Adaptive safety and uncertainty governance using acceptance signals (policy, platform governance)
- What: Use acceptance lengths and per-token acceptance probabilities as live uncertainty signals to gate risky generations, throttle temperature, or escalate to human review.
- Potential workflows: Red-team triggers when long accepted blocks appear in sensitive tasks; dynamic policy tightening when acceptance drops.
- Dependencies: Calibrated acceptance metrics; policy tuning to minimize false positives/negatives.
- Mixture-of-verifiers and ensemble safety (safety engineering)
- What: Verify diffusion-drafted blocks with multiple AR verifiers (accuracy-focused, safety-focused) to jointly optimize quality and risk.
- Potential products: Ensemble verification services; policy-compliant acceptance arbitration.
- Dependencies: Cost overhead of multiple verifiers; arbitration logic; latency management.
- Training-time integration of D2A alignment into base models (model pretraining)
- What: Pretrain or co-train backbones with diffusion-to-AR alignment objectives to natively support blockwise drafting, reducing finetuning costs later.
- Potential workflows: Joint training curricula for base LLMs; native support for DEER-like decoding.
- Dependencies: Compute budgets; objective balancing to avoid degrading AR capabilities.
- Multimodal diffusion drafting for speech/vision-LLMs (multimodal AI)
- What: Extend discrete diffusion drafting to multimodal tokens (e.g., audio units, vision tokens) with AR verification for captioning, VQA, or voice assistants.
- Potential products: Faster captioners and multimodal tutors; on-device audio drafting with cloud verification.
- Dependencies: Tokenization schemes; multimodal dLLM training data; verifier compatibility.
- Grid- and cost-aware schedulers for sustainable serving (energy, cloud economics)
- What: Incorporate DEER-aware schedulers that adjust block sizes and batch sizes based on energy price/carbon intensity and SLOs.
- Potential workflows: Carbon-optimal batch windows; cost-aware inference routing between regions.
- Dependencies: Real-time carbon signals; robust acceptance forecasts; compliance reporting.
Cross-cutting assumptions and dependencies to consider
- Discrete dLLM availability and maturity: Inference kernels, deployment frameworks, and KV-like caching for dLLMs are less mature than for AR LLMs; expect engineering effort.
- Data and alignment: Stage I/II alignment requires domain-appropriate teacher outputs; distribution mismatch will reduce acceptance if data are off-domain.
- Stability windows: Stage II’s exponential weighting (α) shows narrow stability ranges; automated tuning and monitoring are advisable.
- Resource footprint: A 0.5B drafter adds memory/compute overhead; ensure the speedup and cost reduction outweigh the extra model footprint.
- Quality invariance: DEER is “lossless” with AR verification, but acceptance rates depend on sampling temperature, task entropy, and prompt style; production A/B testing recommended.
- Governance and compliance: Domain deployments (e.g., healthcare, finance) require dataset governance, privacy safeguards, and safety verification despite higher efficiency.
Collections
Sign up for free to add this paper to one or more collections.