Papers
Topics
Authors
Recent
2000 character limit reached

Fast-dLLM Caching Strategies

Updated 9 January 2026
  • Fast-dLLM Caching is an innovative suite of caching techniques that leverages blockwise, attention-guided, and hierarchical schemes to enhance diffusion LLM inference.
  • It employs dynamic blockwise execution and adaptive cache eviction based on token saliency to reduce recomputation, yielding throughput improvements from 2.5× to 45×.
  • The strategies integrate both training-free and training-aware methods with hardware-adaptive multi-level storage across GPU, DRAM, and SSD for sustainable LLM deployment.

Fast-dLLM Caching is a set of algorithmic and architectural strategies designed to accelerate the inference of diffusion-based LLMs (dLLMs) through specialized key–value (KV) caching schemes, dynamic blockwise execution, attention-guided cache eviction, multi-level storage architectures, and hardware-adaptive policies. These methods address the fundamental challenge that conventional autoregressive caching analogues are inadequate for dLLMs due to bidirectional attention, blockwise parallel decoding, and distinctive semantic stability patterns. Fast-dLLM Caching encompasses both training-free and training-aware approaches, yielding throughput improvements from 2.5× to 45×, with minimal loss in generation quality, enabling practical, sustainable deployment of block-diffusion LLMs across diverse hardware and workload scenarios (Wu et al., 30 Sep 2025, Wu et al., 28 May 2025, Liu et al., 17 May 2025, Wei et al., 12 Jun 2025, Song et al., 4 Aug 2025, Kim et al., 24 Nov 2025, Huang et al., 10 Oct 2025, Bansal, 18 Dec 2025, Peng et al., 2024).

1. Blockwise Approximate KV Cache Foundations

Efficient dLLM inference relies on partitioning token sequences into contiguous blocks, where intra-block positions are decoded in parallel via discrete masked diffusion, but blocks are emitted left-to-right using an autoregressive-like schedule (Wu et al., 30 Sep 2025). Fast-dLLM Caching exploits the empirical near-invariance of K/V tensors across successive diffusion steps within each block, enabling blockwise reuse (Wu et al., 28 May 2025). For each block kk, only local keys and values are recomputed; all other blocks' K/V states are cached from the last full pass. This yields a cache structure:

K~t()=concati=1K{Kt(),ki=k Kt0(),iik,V~t() analogously\tilde{K}_t^{(\ell)} = \operatorname{concat}_{i=1}^{K} \begin{cases} K_t^{(\ell),k} & i=k \ K_{t_0}^{(\ell),i} & i\ne k \end{cases},\quad \tilde{V}_t^{(\ell)}\ \text{analogously}

This achieves amortized step costs O(BNd)O(BN d), with BNB\ll N, yielding theoretical speedups N/BN/B and empirical improvements up to 27.6×\times (Wu et al., 28 May 2025, Wu et al., 30 Sep 2025).

2. Hierarchical and Sub-block Caching Strategies

The caching paradigm is further refined through hierarchical KV storage. Fast-dLLM v2 introduces a block-level cache, storing K/V activations from all finalized blocks, and a sub-block cache, enabling intra-block parallelism by caching stable K/V states for partially decoded sub-blocks (Wu et al., 30 Sep 2025). The block-level mechanism is formalized as:

cache_K[b,]=Kcur(b,),cache_V[b,]=Vcur(b,)\mathrm{cache\_K}[b,\ell] = K^{(b,\ell)}_{\text{cur}},\quad \mathrm{cache\_V}[b,\ell] = V^{(b,\ell)}_{\text{cur}}

During refinement, only changed sub-blocks are recomputed, yielding additional cost reductions. In aggregate, this hierarchical approach provides 2.5× real-world speedup, with sub-block cache yielding a further 10–20% performance gain at high concurrency (Wu et al., 30 Sep 2025).

3. Attention-Guided, Dynamic, and Sparse Cache Eviction

Memory and compute budgets are protected through attention-aware and dynamic cache eviction. Sparse-dLLM and MaskKV use per-token attention saliency to identify and evict low-importance entries either in the prompt or in the active block (Song et al., 4 Aug 2025, Huang et al., 10 Oct 2025). The saliency metric for each token jj is:

stj=1bi=1bSoftmax(Qb,iKjTdk)s_t^j = \frac{1}{b} \sum_{i=1}^b \mathrm{Softmax}\left(\frac{Q_{b,i} K_j^T}{\sqrt{d_k}}\right)

Tokens below a dynamic retention threshold (e.g., top-rr fraction) are evicted, reducing cache size and computation by a factor r\approx r. MaskKV further introduces per-layer and per-head adaptive budgeting, distributing a fixed global budget via learned coefficients (α\alpha, β\beta), tuned for minimal performance loss even under 20× cache compression (Huang et al., 10 Oct 2025). This yields empirical results such as 94% retention of full-cache scores, 31× throughput improvement at 32k prompt length, and 65% reduction in GPU memory (Huang et al., 10 Oct 2025).

4. Training-free and Consistency-aware Adaptive Caching

Training-free schemes (e.g., dLLM-Cache, AdaBlock-dLLM, Elastic-Cache) exploit structural properties of diffusion models without requiring retraining (Liu et al., 17 May 2025, Lu et al., 30 Sep 2025, Nguyen-Tri et al., 16 Oct 2025). dLLM-Cache partitions caching into long-interval static prompt caches—refreshed every KpK_p steps—and short-interval adaptive response caches—only the fraction ρ\rho of most dynamic tokens is refreshed per step via cosine similarity (Liu et al., 17 May 2025). AdaBlock-dLLM dynamically adjusts block sizes at runtime by analyzing volatility bands in confidence scores, reducing cache and decoding overhead especially where semantic units are misaligned with fixed blocks (Lu et al., 30 Sep 2025).

Training-aware approaches (CDLM) enforce block-wise causal masking during fine-tuning, making models compatible with KV caching and enabling multi-token jumps via a consistency loss (Kim et al., 24 Nov 2025). This further shrinks effective inference steps (NeffLN_{\text{eff}} \ll L), yielding 3.6×–14.5× lower latency at full accuracy—even for code/math tasks.

5. Hardware-Adaptive, Multi-level, and Sustainable Caching

Fast-dLLM Caching is operationalized across heterogeneous storage hierarchies. M²Cache demonstrates a disk-backed mixed-precision adaptive cache spanning GPU HBM, host DRAM, and SSD (Peng et al., 2024). Neurons are ranked by importance (via offline scoring SiS_i), grouped by precision (FP16, INT8, INT4), and dynamically loaded into an LRU cache per layer. The system realizes:

  • Up to 7× token/s speedup on legacy GPUs
  • 5–10× reduction in DRAM-to-HBM traffic
  • 7× reduction in gCO₂ per request

At the accelerator hardware/RTL level, DCO equips multi-core systems with predictive orchestration: dead-block prediction via tensor reuse counts, anti-thrashing priority scoring, and eviction-rate-tuned bypassing, enabling up to 45% LLC hit rates and 1.8× system speedup (Zhou et al., 8 Dec 2025).

6. Layer-wise and Semantic Caching for Transformer Variants

LLMCache generalizes Fast-dLLM concepts to transformers outside block diffusion by wrapping each layer with semantic fingerprinting and memoization banks (Bansal, 18 Dec 2025). Each input XX is summarized to fXf_X, looked up via cosine similarity (sim(fX,f)τ\mathrm{sim}(f_X, f') \ge \tau), and reused when closely similar, allowing accelerated reuse of hidden states for repeat or similar queries. Adaptive staleness-aware and divergence-aware eviction mechanisms maintain output fidelity, and the framework consistently yields 2–3.1× decoding speedups at sub-0.5% accuracy loss.

7. Empirical Performance and Implementation Guidelines

Integrated Fast-dLLM methods have been validated across GSM8K, LongBench, GPQA, HumanEval, HotpotQA, and code/data domains. Blockwise and hybrid caching routinely delivers 2.5–45× throughput increases, with memory budgets slashed by 20× (MaskKV), and dLLM inference latencies approaching autoregressive model baselines (Wu et al., 30 Sep 2025, Wu et al., 28 May 2025, Huang et al., 10 Oct 2025, Liu et al., 17 May 2025).

Key deployment parameters:

Parameter Typical Range Impact (Based on Data)
Block size BB 16–64 N/B speedup; smaller BB yields better quality
Retention ratio rr 0.5 Halves memory/compute; empirically robust
Update ratio ρ\rho 0.25 75% reduction in response update cost
Precision splits FP16/INT8/INT4 Up to 5–10× bandwidth reduction
Eviction policy attn-guided/LRU attn-guided = best quality at tight budgets

For optimal results, practitioners should tune block sizes, cache retention ratios, adaptive budgeting parameters, and precision splits in context-specific validation. Hardware and storage hierarchy should be configured to minimize cache misses and exploit dataflow-derived dead-blocks.


In summary, Fast-dLLM Caching encompasses a spectrum of blockwise, attention-guided, hierarchical, and hardware-adaptive strategies for maximal acceleration of diffusion LLMs and transformer inference. It enables practical, sustainable, and scalable deployment of parallel language generation systems with modern memory subsystems, achieving state-of-the-art efficiency while preserving end-task quality (Wu et al., 30 Sep 2025, Wu et al., 28 May 2025, Liu et al., 17 May 2025, Peng et al., 2024, Huang et al., 10 Oct 2025, Bansal, 18 Dec 2025, Kim et al., 24 Nov 2025, Song et al., 4 Aug 2025, Zhou et al., 8 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fast-dLLM Caching.