dInfer: An Efficient Inference Framework for Diffusion Language Models (2510.08666v2)

Published 9 Oct 2025 in cs.CL and cs.AI

Abstract: Diffusion-based LLMs (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

Summary

The paper introduces dInfer, a modular inference framework that streamlines diffusion LLM decoding with novel hierarchical and credit decoding strategies.
It demonstrates significant speed improvements, achieving up to 10× faster inference than Fast-dLLM through KV-cache vicinity refresh and system-level GPU optimizations.
It supports various dLLM models and incorporates trajectory distillation to further enhance inference efficiency and maintain quality across applications.

dInfer: An Efficient Inference Framework for Diffusion LLMs

Introduction and Motivation

Diffusion-based LLMs (dLLMs) have recently emerged as a viable alternative to autoregressive (AR) LLMs, offering inherent parallelism through iterative denoising. This parallelism enables dLLMs to refine entire token sequences simultaneously, potentially unlocking significant inference speedups and improved hardware utilization. However, practical deployment of dLLMs has been hampered by computational overhead, challenges in scaling parallel decoding, and the absence of a standardized, efficient inference framework. The dInfer framework directly addresses these bottlenecks by modularizing the inference pipeline and introducing both algorithmic and system-level optimizations.

Framework Architecture and Modular Components

dInfer decomposes the dLLM inference process into four core components: model, diffusion iteration manager, decoding strategy, and KV-cache manager. This modular design enables flexible orchestration of algorithms and system optimizations, facilitating extensibility and adaptation to various dLLM architectures.

Figure 1: The orchestration of algorithms in dInfer, illustrating the interaction between model, iteration manager, decoding strategy, and KV-cache management.

Diffusion Iteration Manager

The diffusion iteration manager controls the iterative denoising process, determining which token regions to decode, managing historical predictions, and interacting with the model. dInfer implements both blockwise and iteration smoothing algorithms. The latter fuses token representations across iterations, enriching contextual cues and empirically boosting token confidence, especially when using KV-cache optimizations.

Decoding Strategies

Three parallel decoding strategies are supported:

Threshold Decoding: Commits tokens exceeding a preset confidence threshold.
Hierarchical Decoding: Recursively partitions masked spans, decoding at least one token per region per pass, reducing local dependencies and improving efficiency.
Credit Decoding: Accumulates historical confidence scores, preferentially committing tokens with stable predictions across iterations.

Hierarchical and credit decoding are novel contributions, enabling higher decoding efficiency without retraining the base model.

KV-Cache Management

Bidirectional attention in dLLMs complicates KV-cache reuse, as token representations evolve across denoising steps. dInfer introduces a vicinity refresh strategy, selectively updating KV states for masked tokens and their immediate neighbors, and performing full cache updates upon block completion. This balances computational cost and output quality, outperforming static cache strategies.

Model Support and Trajectory Distillation

dInfer is model-agnostic, supporting LLaDA-MoE, LLaDA-1.5, and LLaDA-Instruct. The framework also incorporates Trajectory Distillation, a post-training technique that fine-tunes models on effective decoding trajectories, enabling the LLaDA-MoE-TD variant to achieve superior parallel decoding efficiency.

System-Level Optimizations

dInfer integrates several system-level optimizations:

Tensor and Expert Parallelism: Applied to linear layers and MoE architectures, respectively, maximizing GPU utilization even at batch size 1.
PyTorch Compilation and CUDA Graphs: Fuses CUDA kernels and eliminates execution overhead, yielding over 200% efficiency improvement.
Loop Unrolling: Eliminates CUDA stream bubbles, reducing launch latency and boosting iteration efficiency by 5-10%.
Early Termination: Halts diffusion loops upon EOS token generation, improving efficiency by 15-40%.
Decoding Algorithm Design: Ensures compatibility with system optimizations by avoiding control-flow operations and minimizing data transfer overhead.

Empirical Evaluation

dInfer is evaluated on six diverse benchmarks: CRUX-O, LiveCodeBenchv6, MBPP, HumanEval, GSM8K, and IFEval. The primary metrics are tokens per forward (TPF) and tokens per second (TPS), measuring parallel decoding capability and overall inference efficiency, respectively.

Figure 2: Benchmark results comparing dInfer (LLaDA-MoE and LLaDA-MoE-TD) with Fast-dLLM and vLLM across six benchmarks, showing average and peak TPS. dInfer achieves up to $10\times$ speedup over Fast-dLLM and $2-3\times$ over vLLM on Qwen-2.5-3B.

Key results:

On HumanEval, dInfer achieves over 1,100 TPS at batch size 1.
Across six benchmarks, dInfer averages over 800 TPS on $8\times$ H800 GPUs.
dInfer delivers a $10\times$ speedup over Fast-dLLM and $2-3\times$ over vLLM (Qwen-2.5-3B) with comparable output quality.
Trajectory Distillation further boosts TPS to 847.22, a 99.8% increase in TPF for mathematical reasoning and a 45.3% average improvement across other domains.

Algorithmic Ablations and Trade-offs

Ablation studies demonstrate the impact of different decoding strategies and cache management techniques. Hierarchical and credit decoding consistently increase TPF and TPS without sacrificing output quality. Vicinity KV-cache refresh outperforms static cache strategies, maintaining accuracy while reducing computational overhead. The modular design allows tailored combinations of algorithms for optimal trade-offs between efficiency and quality.

Practical and Theoretical Implications

dInfer establishes a new standard for dLLM inference efficiency, demonstrating that diffusion-based models can surpass highly optimized AR models in throughput at batch size 1. The framework's modularity and extensibility facilitate rapid experimentation and benchmarking, accelerating research in parallel decoding and denoising-based generation. The integration of trajectory distillation suggests promising directions for further reducing inference latency via post-training techniques.

From a theoretical perspective, dInfer's approach to parallel decoding and cache management provides insights into the interplay between model architecture, algorithmic design, and system-level optimization. The demonstrated efficiency gains challenge the prevailing assumption that AR models are inherently superior for inference speed, opening avenues for broader adoption of dLLMs in latency-sensitive applications.

Conclusion

dInfer provides an efficient, extensible, and standardized inference framework for diffusion LLMs, integrating algorithmic innovations and system-level optimizations to overcome key bottlenecks in dLLM deployment. Empirical results validate substantial throughput improvements without compromising output quality, and the framework's open-source implementation offers a practical toolkit for both research and production. Future work may explore further enhancements in parallel decoding, cache management, and post-training techniques, as well as broader application domains for dLLMs.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces dInfer, a fast and flexible system for running a new kind of LLM called a diffusion LLM (dLLM). Unlike normal LLMs that write one token (piece of a word) at a time, diffusion models “clean up” a whole sentence in several rounds, like sharpening a blurry photo step by step. dInfer makes this process much faster and more reliable, so these models are practical to use.

What questions does it try to answer?

The paper focuses on three simple questions:

How can we make diffusion LLMs run faster without hurting the quality of their answers?
How can we safely decode many tokens at once (in parallel) without the results getting worse?
Can we build a standard, open-source framework so different diffusion models can be tested fairly and improved consistently?

How did the researchers approach it?

The authors built dInfer as a modular framework, like a set of Lego blocks you can snap together. Each block handles a key part of the model’s inference (generation) process.

The four building blocks in dInfer

Model: The actual LLM (like LLaDA-MoE), which predicts the next tokens.
Diffusion iteration manager: The “round controller” that decides which parts of the sentence to improve in each step and passes information between steps.
Decoding strategy: The rules for choosing which tokens to “lock in” (commit) during each round.
KV-cache manager: A smart memory system that reuses past calculations where possible, so the model doesn’t redo expensive work every time.

Smart decoding strategies (how dInfer chooses tokens to commit)

To safely decode many tokens at once, dInfer uses:

Threshold decoding: Commit tokens only when the model is very confident (above a preset threshold).
Hierarchical decoding (new): Break masked spans into smaller chunks and make sure at least one token per chunk is committed. This reduces local confusion and speeds things up.
Credit decoding (new): Give “credit points” to tokens that keep showing up as likely across multiple rounds. Tokens with strong, consistent scores get committed sooner.

Smarter memory and iteration tricks

Iteration smoothing (new): Instead of ignoring uncertain positions, dInfer reuses their probability distributions to create better embeddings (representations) for the next round. Think of it like carrying forward a “soft hint” so later steps start with a smarter guess.
Vicinity KV-cache refresh (new): KV-cache is a kind of “sticky note” that remembers attention calculations. In diffusion models, the whole sentence can change each round, so you can’t reuse everything. dInfer refreshes only the cache around the current decoding block (nearby tokens), and updates the whole cache when a block is fully decoded. This balances speed and accuracy.
Blockwise decoding: Split the sequence into fixed-size blocks and improve them step by step.
System-level speedups:
- Tensor parallelism: Split heavy math across multiple GPUs.
- Expert parallelism: For MoE models (mixture of experts), send work to different “specialist” networks and still run fast even with one request at a time.
- PyTorch compile + CUDA Graphs: Fuse many small GPU operations into bigger ones to reduce overhead.
- Loop unrolling: Keep GPU work flowing smoothly across iterations (fewer idle gaps).
- Early stopping: If the end-of-sequence token appears, stop immediately to avoid wasting time.

Training tweak to speed up decoding further

Trajectory Distillation (TD): Teach the model to “jump” ahead by learning from its best past generation paths. This helps the model decode more tokens per step, increasing speed without changing the base architecture. The result is LLaDA-MoE-TD, a faster version of LLaDA-MoE.

What did they find?

On strong hardware (8× NVIDIA H800 GPUs) and real benchmarks (code, math, and instruction-following), dInfer achieved:

Over 1,100 tokens per second (TPS) on HumanEval (a code test).
Over 800 TPS on average across six benchmarks at batch size 1 (one question at a time).

Compared to other systems:

About 10× faster than Fast-dLLM, with similar accuracy.
About 2–3× faster than a well-optimized autoregressive model (Qwen2.5-3B running on vLLM), while keeping quality comparable.

With Trajectory Distillation (LLaDA-MoE-TD):

Average TPS increased to about 847, beating the autoregressive baseline by more than 3×.
The model committed more tokens per iteration (higher TPF), meaning each round did more useful work.

These results show diffusion LLMs can be extremely fast in practice—sometimes even faster than traditional models—if you design the inference carefully.

Why does this matter?

Faster apps: High-speed inference means quicker responses in coding assistants, math tutors, and chatbots.
Lower costs: Doing more work per GPU second reduces energy and cloud bills.
New possibilities: Reliable parallel decoding opens the door to new features that are hard with one-token-at-a-time models.
Fair comparisons: A standard framework helps researchers measure and improve diffusion models consistently.

Key takeaways

Diffusion LLMs don’t have to be slow. With smart decoding, memory reuse, and GPU tricks, they can be very fast—sometimes faster than traditional models—even when handling one request at a time.
dInfer’s four-part design (model, iteration manager, decoder, cache manager) makes it easy to plug in new ideas and test them fairly.
New algorithms like hierarchical decoding, credit decoding, iteration smoothing, and vicinity KV-cache refresh make parallel decoding both safe and efficient.
Trajectory Distillation trains diffusion models to “take bigger jumps,” increasing tokens-per-step and overall speed.
The framework is open-source, so others can build on it: https://github.com/inclusionAI/dInfer

In short, dInfer shows that diffusion-based LLMs are not just a cool idea—they can be practical, fast, and ready for real-world use.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper—each item is stated concretely to guide future work.

Generality across dLLM architectures: dInfer is evaluated only on LLaDA variants; test portability and performance on other diffusion LMs (e.g., different attention schemes, non-MoE models, larger/smaller sizes).
Batch-size scalability: results are limited to batch size 1; measure throughput, latency, and GPU utilization under larger batch sizes and multi-tenant workloads.
Long-context behavior: generation length is fixed (1024) with block size 64; paper performance, stability, and accuracy for longer contexts (e.g., 4k–32k), including scaling laws and block-size sensitivity.
Hyperparameter sensitivity: thresholds, smoothing weights, and cache windows are hand-picked; provide systematic sensitivity analyses and practical tuning guidelines across tasks.
Iteration smoothing validity: no formal analysis of distribution shift or convergence; quantify when logits-to-embedding fusion harms or helps, and derive principled schedules for α and decode-threshold decay.
Early-commit risks in hierarchical/credit decoding: investigate failure modes where premature commitments introduce irreversible errors; design rollback/error-correction mechanisms and measure semantic consistency impacts.
KV-cache vicinity refresh policy: window sizes are fixed; develop adaptive or confidence-based refresh strategies, and characterize accuracy–speed trade-offs with theoretical or empirical bounds on staleness error.
EOS early termination reliability: assess false-EOS detection and its effect on incomplete outputs (multi-part code, long reasoning, multi-section instructions); propose safeguards and validation checks.
System-level ablations: isolate and quantify the contribution of torch.compile, CUDA Graphs, loop unrolling, TP, and EP individually across benchmarks and hardware.
Memory and energy efficiency: report peak VRAM, memory fragmentation, and energy/cost per token; evaluate feasibility on consumer GPUs and under constrained memory budgets.
Broader task coverage: expand beyond code/math/instruction to general NLG (summarization, translation, open-ended chat), multilingual settings, and safety/bias/toxicity benchmarks.
Latency metrics and streaming: provide time-to-first-token, time-to-last-token, tail-latency distributions, and streaming behavior analyses for interactive applications.
Fair comparison protocols: comparisons use different architectures (MoE vs dense) and dev toolchains; establish matched-parameter baselines, cross-hardware replication, and uniform software stacks.
Trajectory Distillation/Compression scope: quantify domain generalization beyond math-heavy datasets, training cost, data requirements, and potential overfitting or degradation on non-math tasks.
Compatibility with quantization/sparsity: evaluate FP8/INT8/INT4 quantization, structured sparsity, and memory-saving techniques; determine their interplay with bidirectional attention and caching.
Robustness to noisy/adversarial inputs: test stability of parallel decoding under perturbations, ambiguous prompts, and long multi-step reasoning; define robustness metrics and mitigation strategies.
Multi-turn dialogue and tool use: specify mechanisms for maintaining conversation state, tool outputs, and external memory under bidirectional attention and cache refreshes; evaluate in agentic workflows.
Standardizing TPS/TPF reporting: propose a reproducible benchmark suite (fixed seeds, prompts, lengths, hardware, software versions) with public logs to avoid apples-to-oranges comparisons.
Adaptive iteration scheduling: explore dynamic block sizing and step scheduling driven by confidence or uncertainty to optimize quality–speed trade-offs per input.
Cross-node and multi-node scaling: evaluate performance on multi-node clusters, interconnect bottlenecks, and commodity hardware; characterize scaling efficiency and communication overheads.
Theoretical grounding: develop formal criteria for decoding thresholds, smoothing schedules, and convergence guarantees in diffusion-based language generation.
Cache-error detection and recovery: design mechanisms to detect staleness-induced errors (e.g., confidence drift) and trigger selective re-computation or targeted refreshes.
Expert parallelism effects: analyze how EP impacts token quality and routing stability at batch size 1, including potential expert collapse or oscillations under rapid iterations.
Safety and alignment: measure how aggressive parallel decoding affects jailbreak susceptibility and harmful outputs; integrate safety filters or alignment-aware decoding rules.
Framework extensibility: document API stability and integration paths with RLHF/preference optimization pipelines and common inference servers (e.g., Triton, Ray Serve, Kubernetes autoscaling).
Detailed ablations for decoding algorithms: beyond average metrics, report per-task breakdowns of error types, commit rates, and confidence dynamics to diagnose where each method helps or harms.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, organized by sector and supported by the paper’s concrete methods (hierarchical/credit decoding, iteration smoothing, vicinity KV-cache refresh, tensor/expert parallelism, CUDA Graphs, loop unrolling, EOS early-termination) and the open-source dInfer framework.

Software/AI Infrastructure
- Low-latency dLLM serving at batch size 1
- Use case: Replace or complement AR LLM endpoints (e.g., chat, agents) with dLLM endpoints powered by dInfer to cut single-user latency and GPU costs.
- Tools/workflows: Integrate dInfer into existing serving stacks (vLLM backend, TorchServe/Triton, Ray); turn on TP+EP, torch.compile, CUDA Graphs; enable loop-unrolling and EOS early termination.
- Assumptions/Dependencies: NVIDIA GPUs (e.g., H800/A100/H100) with recent CUDA; PyTorch 2.9+; dLLM availability (e.g., LLaDA-MoE/-Instruct); quality parity validated for target tasks.
- Parallel-decoding A/B testing without retraining
- Use case: Swap threshold, hierarchical, and credit decoding per route to maximize tokens per forward on different workloads.
- Tools/workflows: dInfer’s modular APIs; CI pipelines that test TPS/TPF and task accuracy; telemetry dashboards for “TPS per sequence” reporting.
- Assumptions/Dependencies: Task-specific hyperparameter tuning for decode thresholds and IterSmooth weights; monitoring to detect quality regressions at large parallel spans.
Developer Tools (Code Assistants, IDE Plugins)
- Faster code generation and live debugging
- Use case: Interactive code completion and test generation with reduced waiting time, especially for long outputs or multi-step solutions.
- Tools/products: dLLM-backed code assistant services; IDE plugins that call dInfer endpoints; EOS early-termination to stop once a solution is produced.
- Assumptions/Dependencies: dLLM trained or adapted for code; stable accuracy on datasets like HumanEval/MBPP; GPU inference for real-time use.
Education
- Real-time math and reasoning tutors
- Use case: Deploy dLLM tutors for step-by-step math reasoning and long-form explanations with lower latency.
- Tools/workflows: dInfer with iteration smoothing to stabilize intermediate steps; hierarchical decoding to commit tokens reliably per span.
- Assumptions/Dependencies: Domain tuning and evaluation on GSM8K-like data; safeguards for correctness and pedagogical quality.
Enterprise AI (General)
- Cost-efficient agent loops and tool-use orchestrations
- Use case: Internal copilots, RAG pipelines, and workflow agents that issue many short, sequential calls benefit from dInfer’s speed at batch size 1.
- Tools/workflows: Endpoint migration for dLLM-powered routes; KV vicinity refresh to balance speed and quality on bidirectional attention.
- Assumptions/Dependencies: Task fits dLLM strengths; observability for latency/accuracy trade-offs; GPU scheduling tuned for TP/EP.
Academia
- Standardized dLLM inference benchmarking
- Use case: Adopt “TPS per sequence” and “TPF” metrics to fairly compare algorithms and systems; use dInfer as a reproducible baseline.
- Tools/workflows: Public benchmarks (HumanEval, GSM8K, etc.); ablation-ready configurations; shared scripts for caching/decoding variants.
- Assumptions/Dependencies: Community agreement on reporting conventions (hardware, batch size, lengths); dataset licensing.
Policy and Governance
- Procurement and transparency standards for AI inference
- Use case: Require vendors to report TPS per sequence at batch size 1, energy-per-token, and accuracy on standardized suites when bidding for AI services.
- Tools/workflows: Internal policy templates referencing dInfer-style metrics and configurations; reproducibility checks.
- Assumptions/Dependencies: Access to comparable hardware; independent verification procedures; alignment with sustainability targets.
Daily Life (Power Users, Local AI)
- Faster local assistants for long-form tasks
- Use case: Users with high-end GPUs run dLLM assistants (writing, math, coding) with less lag using dInfer.
- Tools/workflows: dInfer docker/conda setups; pre-configured profiles for “single-user, low-latency.”
- Assumptions/Dependencies: Local NVIDIA GPU; familiarity with dLLM models; quality expectations aligned with open-source dLLMs.

Long-Term Applications

These applications require further research, scaling, productization, or broader ecosystem adoption (models, hardware, standards).

General-Purpose dLLM Inference Engines
- Mainstream integration into production inference stacks
- Use case: First-class support for dLLMs in major engines (Triton, vLLM, Ray Serve) with native TP+EP and CUDA Graph orchestration.
- Potential products: “dLLM-ready” serving runtimes; autoscaling schedulers optimized for iterative denoising loops.
- Assumptions/Dependencies: Widespread dLLM adoption; robust APIs across vendors; mature ops/tooling for cache refresh policies.
Training Pipelines Using Trajectory Distillation (Trajectory Compression)
- Systematic reduction of diffusion steps across domains
- Use case: Fine-tune domain-specific dLLMs (math, code, legal) to “jump” multiple steps via golden trajectory distillation, improving TPF/TPS.
- Tools/workflows: Data collection harnesses with external verifiers (e.g., math_verify); curriculum for compressed transitions; offline evaluation loops.
- Assumptions/Dependencies: High-quality verifier availability; scalable data generation; guardrails to prevent overfitting or shortcut reasoning.
Edge and Embedded Deployment
- Real-time language reasoning at the edge
- Use case: On-device assistants in robotics, AR/VR, and mobile with reduced latency leveraging parallel decoding and early termination.
- Tools/products: Lightweight dLLM variants; hardware-aware decoding schedules; memory-aware KV vicinity refresh implementations.
- Assumptions/Dependencies: Smaller, efficient dLLMs; hardware accelerators beyond high-end GPUs; careful power/thermal management.
Sector-Specific Solutions
- Healthcare
- Use case: Low-latency clinical summarization and decision support where responsiveness matters during consultations.
- Tools/workflows: dLLM endpoints with confidence tracking (credit decoding) and audit logs of intermediate states (iteration smoothing artifacts).
- Assumptions/Dependencies: Regulatory clearance; rigorous validation; bias/safety monitoring; domain-verified trajectories.
- Finance
- Use case: Real-time compliance assistants and report drafting under strict SLAs.
- Tools/workflows: Parallel decoding tuned for long documents; EOS early termination to cut compute once policy criteria are met.
- Assumptions/Dependencies: Model robustness to domain-specific jargon; auditability; privacy and compliance constraints.
- Robotics
- Use case: Faster language-to-action planning in multi-step tasks via parallel refinements, reducing control-loop delays.
- Tools/workflows: Agent frameworks that exploit dLLM denoising iterations; integration with perception/action modules.
- Assumptions/Dependencies: Safety guarantees; deterministic control; on-device inference viability.
- Education
- Use case: Adaptive tutoring systems that generate multi-step solutions quickly while retaining accuracy via trajectory-tuned models.
- Tools/workflows: Distilled trajectories aligned with curricula; per-learner latency/quality knobs (decode thresholds, IterSmooth).
- Assumptions/Dependencies: Content quality assurance; avoidance of shortcut learning; fairness across learners.
Hardware–Software Co-Design
- Accelerators tailored to iterative denoising and bidirectional attention
- Use case: New kernels and scheduling primitives that eliminate stream bubbles and maximize EP/TP utilization at small batch sizes.
- Tools/products: Vendor-optimized libraries for dLLMs; firmware/runtime features to support loop unrolling and cache refresh semantics.
- Assumptions/Dependencies: Industry interest; standardized dLLM primitives; cross-vendor portability.
Benchmarking and Governance Standards
- Industry-wide adoption of fair, reproducible inference metrics
- Use case: TPS per sequence at standardized lengths/hardware; disclosure of caching/decoding configs; energy-per-token reporting.
- Tools/workflows: Public leaderboards; certification processes; audit trails with dInfer-like config manifests.
- Assumptions/Dependencies: Community consensus; third-party auditing; versioning of models and toolchains.

Cross-Cutting Assumptions and Dependencies

Model availability and fitness: dLLMs (e.g., LLaDA-MoE/-Instruct/-TD) must match task accuracy requirements; large parallel spans may require careful tuning to avoid quality drops.
Hardware and toolchains: Benefits rely on modern NVIDIA GPUs, CUDA Graphs, PyTorch compile, and expert/tensor parallelism; portability to other hardware stacks may need additional engineering.
Task profiles: Gains are highest for long outputs and multi-step reasoning; short or highly constrained tasks may see smaller relative improvements.
Operational readiness: Observability (TPS/TPF, accuracy, energy), guardrails, and fallback routes (AR models) are prudent for production deployments.

View Paper Prompt View All Prompts

Glossary

Autoregressive (AR) LLMs: LLMs that generate output sequentially, one token at a time, conditioning each token on all previous ones. "Diffusion-based LLMs (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism."
Bidirectional attention: An attention mechanism where each token can attend to all others (both past and future positions), allowing mutual influence across the sequence. "dLLMs employ bidirectional attention, meaning that decoding a single token can affect the representations of all tokens in the sequence."
Blockwise caching: A training-free KV-cache strategy that reuses attention key/value states for blocks of tokens to reduce recomputation. "Earlier approaches introduced training-free strategies such as blockwise caching and Dual Cache \citep{wu2025fast}, which reuse KV states for decoded tokens or suffixes of masked tokens."
Blockwise decoding: A decoding procedure that processes the sequence in fixed-size blocks, potentially enabling early stopping once completion conditions are met. "We also introduce an early termination mechanism for blockwise decoding."
Blockwise diffusion iteration: A diffusion-based decoding approach that denoises fixed-size spans in each iteration. "The blockwise diffusion iteration algorithm performs decoding in fixed-size spans and serves as a baseline (Algorithm~\ref{alg:blockwise})."
Causal attention: An attention pattern where each token can only attend to previous tokens, enabling KV reuse in autoregressive models. "In AR models, causal attention allows KV states to be computed once and reused; in dLLMs, however, token representations evolve across denoising steps, making static reuse infeasible."
Credit decoding: A parallel decoding algorithm that accumulates historical confidence as “credits” to commit consistently stable tokens earlier. "Credit decoding (ours): accumulates historical confidence scores as credits and preferentially commits tokens with consistently stable predictions, improving reliability across iterations."
CUDA Graphs: A CUDA feature that captures and replays sequences of GPU operations to reduce launch overhead and improve performance. "dInfer uses PyTorch's just-in-time (JIT) compiler torch.compile to fuse CUDA kernels and execute them within NVIDIA CUDA Graphs, thereby eliminating PyTorch execution overhead."
CUDA stream bubbles: Idle gaps between consecutive GPU kernel launches on CUDA streams that cause underutilization and throughput loss. "Diffusion iterations can suffer from CUDA stream bubbles--idle gaps of consecutive kernel launches between diffusion iterations--which waste GPU cycles and reduce throughput."
Denoising-based generation: A generative process that iteratively refines noisy sequences toward clean outputs, enabling parallel token updates. "Diffusion-based LLMs (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism."
Diffusion-based LLMs (dLLMs): LLMs that generate text by iteratively denoising entire sequences rather than producing tokens sequentially. "Diffusion-based LLMs (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism."
Diffusion iteration manager: The controller of iterative denoising that selects regions to decode, runs model passes, and tracks history for context. "The diffusion iteration manager acts as the controller of the iterative denoising process, with three main responsibilities: 1) determining the next region of tokens to decode, 2) interacting with the model to obtain outputs such as logits and hidden states, 3) maintaining historical predictions to provide a richer context for future decoding."
Dual Cache: A KV-cache reuse method that caches two sets of key/value states (e.g., for decoded and masked segments) to reduce computation. "Earlier approaches introduced training-free strategies such as blockwise caching and Dual Cache \citep{wu2025fast}, which reuse KV states for decoded tokens or suffixes of masked tokens."
End-of-sequence (EOS) token: A special token indicating sequence termination, enabling early stopping in generation. "Once an end-of-sequence (EOS) token is generated within a block, subsequent decoding steps on the remaining blocks become redundant."
Expert parallelism (EP): A distributed inference/training scheme that parallelizes computation across experts in Mixture-of-Experts models. "Expert parallelism is applied to the LLaDA-MoE model and is effective even at a batch size of 1--unlike in AR models, where expert parallelism typically requires large batch sizes."
Hierarchical decoding: A divide-and-conquer parallel decoding method that recursively splits masked spans to reduce local dependencies and ensure progress. "Hierarchical decoding (ours): recursively partitions masked spans, ensuring at least one token is decoded per region, thereby reducing local dependencies and improving efficiency."
Iteration smoothing: A technique that carries information across denoising steps by fusing previous iteration representations into current embeddings. "The iteration smoothing algorithm improves upon this by retaining token representations from the previous iteration and fusing them with the next iteration’s embeddings."
K and V states: The keys (K) and values (V) computed in attention layers, often cached to avoid recomputation. "During denoising, K and V states are recomputed for both masked tokens and their immediate neighbors; once a block is fully decoded, a full cache update ensures global consistency."
KV-cache: A cache of attention keys and values used to speed up Transformer inference by reusing intermediate states. "A central challenge in dLLM inference is KV-cache incompatibility."
Logits: The unnormalized model outputs (pre-softmax) representing token scores before conversion to probabilities. "interacting with the model to obtain outputs such as logits and hidden states"
Loop unrolling: An execution optimization that restructures loops to minimize overhead and keep GPU pipelines busy. "To address this, dInfer applies a loop unrolling strategy that allows Python to launch CUDA kernels continuously without being blocked by stream synchronization."
Tensor parallelism (TP): Model-parallel technique that splits tensor computations across GPUs to accelerate large models. "Tensor parallelism is applied to the linear layers preceding attention modules, distributing dense computations efficiently across multiple GPUs."
Threshold decoding: A parallel decoding heuristic that commits tokens whose confidence exceeds a preset threshold. "Threshold decoding (from Fast-dLLM \citep{wu2025fast}): commits tokens whose confidence exceeds a preset threshold."
Tokens per forward (TPF): A throughput metric measuring how many tokens are committed per diffusion iteration per sequence. "To evaluate the efficiency of the dInfer framework, we use tokens per forward (TPF) per sequence to measure the parallel decoding capability within a single diffusion iteration, and tokens per second (TPS) per sequence to assess overall inference efficiency."
Tokens per second (TPS): A throughput metric measuring the number of generated tokens per second per sequence. "To evaluate the efficiency of the dInfer framework, we use tokens per forward (TPF) per sequence to measure the parallel decoding capability within a single diffusion iteration, and tokens per second (TPS) per sequence to assess overall inference efficiency."
torch.compile: PyTorch’s JIT compilation API that fuses and optimizes kernels for faster execution. "dInfer uses PyTorch's just-in-time (JIT) compiler torch.compile to fuse CUDA kernels and execute them within NVIDIA CUDA Graphs, thereby eliminating PyTorch execution overhead."
Trajectory Compression: A fine-tuning method that trains the model to “jump” multiple denoising steps by learning compressed transitions along high-quality trajectories. "we propose a novel second-stage fine-tuning method, termed Trajectory Compression, to explicitly reduce the number of required sampling steps."
Trajectory Distillation: A post-training technique that uses effective generation trajectories to improve parallel decoding efficiency. "Specifically, we introduce Trajectory Distillation, applied to LLaDA-MoE, yielding the enhanced LLaDA-MoE-TD variant (see Section~\ref{sec:TD})."
vLLM: A high-performance LLM inference engine used as the backend for execution and parallelism. "dInfer builds on vLLM’s backend to exploit two complementary forms of parallelism."
Vicinity KV-cache refresh: A cache update strategy that selectively recomputes K/V states for a local window near the current decoding block to reduce staleness. "To balance cost and performance, dInfer introduces vicinity KV-cache refresh."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (23)

First 10 authors:

Collections

GitHub

GitHub - inclusionAI/dInfer: dInfer: An Efficient Inference Framework for Diffusion Language Models (35 stars)

Tweets

This paper has been mentioned in 3 tweets and received 15 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

dInfer: An Efficient Inference Framework for Diffusion Language Models (29 likes, 0 questions)

dInfer: An Efficient Inference Framework for Diffusion Language Models (2510.08666v2)

Summary

dInfer: An Efficient Inference Framework for Diffusion LLMs

Introduction and Motivation

Framework Architecture and Modular Components

Diffusion Iteration Manager

Decoding Strategies

KV-Cache Management

Model Support and Trajectory Distillation

System-Level Optimizations

Empirical Evaluation

Algorithmic Ablations and Trade-offs

Practical and Theoretical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it try to answer?

How did the researchers approach it?

The four building blocks in dInfer

Smart decoding strategies (how dInfer chooses tokens to commit)

Smarter memory and iteration tricks

Training tweak to speed up decoding further

What did they find?

Why does this matter?

Key takeaways

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (23)

Collections

GitHub

Tweets

YouTube

alphaXiv