Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion Caching for Fast Inference

Updated 28 November 2025
  • Diffusion caching is a method that accelerates inference by reusing cached intermediate representations, minimizing redundant neural network computations.
  • It employs static, dynamic, and hybrid strategies to achieve real-world speedups of 2–6× while controlling quality degradation.
  • The approach supports various architectures—including U-Net, Vision Transformer, and multimodal models—through techniques like error correction and forecasting.

Diffusion caching is a suite of inference-time acceleration strategies for diffusion-based generative models that systematically exploit redundant computations across reverse denoising steps by caching and reusing intermediate representations or latent states. Unlike model distillation or architectural modifications, diffusion caching operates without retraining and is compatible across dominant diffusion architectures, including U-Net, Vision Transformer (DiT), and diffusion-based language or multimodal models. By storing intermediate feature maps, tokens, or hidden states and selectively reusing them in subsequent timesteps, these methods significantly reduce FLOPs and wall-clock latency, often with negligible or controllable degradation in output quality.

1. Theoretical Foundations and Taxonomy

Diffusion caching exploits the temporal and sometimes spatial (token/channel/frequency) redundancy intrinsic to the multi-step inference trajectory of a diffusion model. In canonical diffusion sampling, each denoising step applies a large neural network to predict the update (e.g., noise residual or velocity) for the latent state xtx_t; adjacent steps typically produce highly similar activations, especially at intermediate timesteps where the Markovian trajectory exhibits near-linear evolution in feature space (Liu et al., 22 Oct 2025). By caching carefully chosen intermediate states, operations, or features, caching reduces the effective number of expensive forward computations.

Caching designs can be classified along three axes (Liu et al., 22 Oct 2025):

  • Trigger Condition: Static (fixed-interval) vs. dynamic (similarity or error-threshold-based).
  • Reuse Granularity: Layer/block, token, feature-dimension, or frequency-band.
  • Update/Refresh Policy: Regular, adaptive, or predictive/forecasting (e.g., using Taylor or ODE solvers).

This has yielded a rich taxonomy: static reuse (Selvaraju et al., 1 Jul 2024), adaptive layer- or token-wise reuse (Ma et al., 3 Jun 2024, Zou et al., 5 Oct 2024), block-level dynamic policies (Cui et al., 17 Sep 2025), predictive caching using learned or ODE-based extrapolation (Zheng et al., 22 Aug 2025, Zheng et al., 5 Oct 2025), and hybrid approaches combining multiple scheduling or forecast strategies (Bu et al., 24 Aug 2025, Zheng et al., 5 Oct 2025).

2. Core Algorithms and Implementations

At a high level, the majority of diffusion caching implementations operate as follows:

  • For selected units (layers, blocks, tokens, frequencies), feature outputs or latent vectors are cached at "full compute" steps.
  • At subsequent timesteps, the inference pass consults the cache and—according to a schedule, error threshold, or adaptive policy—either (a) reuses features directly, (b) performs a cheap correction or projection, (c) performs a forecast via multistep integration (e.g., Adams–Bashforth, Hermite, ODE solvers), or (d) reverts to full computation and refreshes the cache (Yu et al., 13 Apr 2025, Chen et al., 9 May 2025, Zheng et al., 22 Aug 2025, Liu et al., 9 Oct 2025).
  • Some methods, such as Learning-to-Cache (L2C) (Ma et al., 3 Jun 2024), train a per-step, per-layer schedule to minimize total approximation error subject to a budget, resulting in a static or input-invariant routing mask.

Recent advances explore token-wise (Zou et al., 5 Oct 2024), cluster-wise (Zheng et al., 12 Sep 2025), or frequency-aware (Liu et al., 9 Oct 2025) caching to further exploit redundancy. For instance, ToCa performs adaptive token masking based on self-/cross-attention and local frequency of prior cache use, while ClusCa performs k-means clustering at each full-compute step and computes only one representative per cluster, propagating the results efficiently to all cluster members.

Typical cache policies can be symbolized as: Fl(xt−k)≈Ctlifk<NF^l(x_{t-k}) \approx \mathcal{C}^l_t \quad \text{if} \quad k < N or, in ODE-motivated settings: Fk+1l,cal=Fkl+h2(gθ(Fkl,tk)+gθ(F^k+1l,tk+1))F_{k+1}^{l,\mathrm{cal}} = F^{l}_k + \frac{h}{2}(g_\theta(F^{l}_k, t_k) + g_\theta(\hat{F}^{l}_{k+1}, t_{k+1})) where Ctl\mathcal{C}^l_t is the cache, gθg_\theta denotes the intrinsic temporal derivative of the feature-ODE over the hidden representation (Zheng et al., 22 Aug 2025).

3. Adaptive and Hybrid Caching Strategies

As static-interval caching (e.g., FORA, DeepCache (Selvaraju et al., 1 Jul 2024)) is limited by error accumulation at long intervals, recent work proposes several adaptive and hybrid methods:

  • Similarity-Thresholded Caching: Blocks or layers are only cached when the relative L1/L2 deviation between successive activations falls below a threshold (Cui et al., 17 Sep 2025). The threshold can be time-dependent, allowing for tighter early/late-step intervals.
  • Probe-Driven or Self-Adaptive Caching: Online shallow-layer probes measure feature change at each step and adapt cache schedules dynamically (e.g., DiCache (Bu et al., 24 Aug 2025)).
  • Forecast-and-Calibrate: View layer activations as evolving by an ODE, allowing multistep linear predictors or Hermite/BDF2 extrapolators for forecasted reuse, with periodic calibration (Yu et al., 13 Apr 2025, Zheng et al., 22 Aug 2025, Zheng et al., 5 Oct 2025, Liu et al., 9 Oct 2025).
  • Hybrid or Dimensionwise Solvers: Different feature dimensions or clusters are assigned the most appropriate forecast method (Euler, high-order BDF, Adams–Bashforth, etc.) by offline profiling and clustering, yielding substantial stability at large cache intervals (Zheng et al., 5 Oct 2025).
  • Error Correction and Residual Alignment: Many approaches now attach lightweight linear or affine corrections (e.g., channel-wise scaling, decoupled error correction (Liu et al., 3 Mar 2025, Chen et al., 9 May 2025)) or blend multi-point cache trajectories to minimize local error (Bu et al., 24 Aug 2025).

The table below summarizes exemplar approaches:

Method Main Granularity Adaptive? Correction/Forecast Notable Results
FORA Layer (static-cycle) No None 2–4.5× speedup, +0.4 FID (Selvaraju et al., 1 Jul 2024)
L2C Layer, learned mask Partly Differentiable router 1.74×, <0.01 FID (Ma et al., 3 Jun 2024)
ToCa Token-wise, layerwise Yes Selection scores 2.36×, near-zero ΔFID (Zou et al., 5 Oct 2024)
ClusCa Cluster (spatial) Yes K-means propagation 4.96× FLUX, +0.51% IR (Zheng et al., 12 Sep 2025)
FoCa ODE-based (feature) Yes BDF2+Heun 5.5–6.45×, ≤0.01 ImageReward (Zheng et al., 22 Aug 2025)
FreqCa Frequency domain Yes Hermite interp./reuse 6–7×, <2% ΔIR, –99% memory (Liu et al., 9 Oct 2025)
DiCache Layer, probe-adaptive Yes Multi-residual align 3.2×, +0.13 SSIM (Bu et al., 24 Aug 2025)
AB-Cache Layer, ODE-multistep No Adams–Bashforth 3×, –2.3 FID, video+image (Yu et al., 13 Apr 2025)

4. Performance Evaluation and Trade-offs

Quantitative evaluation on benchmarks such as ImageNet (DiT-XL/2), FLUX.1-dev, Qwen-Image, PixArt-α, and multimodal video/image datasets demonstrates that diffusion caching can consistently achieve real-world speedups of 2–6× with minimal loss in output metrics (e.g., FID, LPIPS, SSIM, ImageReward). Dynamic and ODE-based/predictive methods (e.g., FoCa, HyCa, FreqCa) can sustain up to ~80% per-step reuse (~5–6× acceleration) with <1% quality degradation, whereas static uniform approaches (FORA, DeepCache) usually break down beyond 2–3× (Liu et al., 22 Oct 2025, Cui et al., 17 Sep 2025, Zheng et al., 5 Oct 2025).

Trade-offs are governed by the choice of cache interval, granularity, adaptive policy, and correction type:

  • Longer intervals or larger cache ratios yield higher acceleration but may lead to feature drift or blur, unless controlled by trajectory-aware or error-adaptive strategies (Zheng et al., 22 Aug 2025, Yu et al., 13 Apr 2025).
  • Fine-grained token, cluster, or frequency splitting yields further redundancy extraction, but increases scheduling and selection overhead (Zheng et al., 12 Sep 2025, Liu et al., 9 Oct 2025).
  • Combined approaches (caching + quantization + error correction) can further compound gains but require careful error tracking (see CacheQuant (Liu et al., 3 Mar 2025)).

5. Practical Implementation and System Integration

Diffusion caching is implemented with minimal framework or model changes. Feature caches typically store activations for selected layers (attention, MLP, or block output) and are indexed by timestep, layer, and, if relevant, token or cluster (Selvaraju et al., 1 Jul 2024, Zou et al., 5 Oct 2024, Zheng et al., 12 Sep 2025). Cache refresh and scheduling logic can be statically precomputed (as in L2C, ICC (Ma et al., 3 Jun 2024, Chen et al., 9 May 2025)) or dynamically updated via online probes and similarity checks (DiCache, BWCache (Bu et al., 24 Aug 2025, Cui et al., 17 Sep 2025)). Some approaches require a short pre-calibration or offline profiling phase to determine optimal solvers (HyCa, FreqCa) or cache tables (EB-Cache (Zou et al., 10 Mar 2025)).

Memory overhead is modest, typically limited to one or two feature maps per block or (in compact schemes) a single cumulative residual per timestep (Liu et al., 9 Oct 2025). With aggressive cache granularity (token/cluster/frequency), total memory overhead can be reduced by >99% relative to naive per-layer caching.

System-level deployment (e.g., Nirvana (Agarwal et al., 2023)) benefits from LRU-style cache eviction, prompt-conditional retrieval, and per-prompt cache hit probability modeling.

6. Security, Privacy, and Vulnerabilities

Approximate or cross-prompt caching introduces new attack surfaces in cloud or multi-user diffusion serving (Sun et al., 28 Aug 2025). Confirmed exploits include:

  • Prompt Exfiltration: Attackers can probe latency and image similarities to recover cached prompt embeddings, reconstructing the prompt text or image.
  • Cache Poisoning: Adversaries can inject prompts that poison cache content, causing subsequent users sharing similar prompt embeddings to receive manipulated outputs (e.g., with attacker logos).
  • Covert Channels: By coordinated insertion/probing, two parties can transmit information via cache state and generation latency.

Empirical attacks on FLUX/SD3 showed 97.8% covert channel bit accuracy, prompt CLIP sim 0.75–0.81, and up to 60% render rates for simple logo poisoning (Sun et al., 28 Aug 2025). Defense mechanisms include randomized cache selection, stricter content filters, per-user/session isolation, and cache noise injection.

7. Limitations, Open Challenges, and Research Directions

Current challenges include:

  • Memory/Compute Overhead: Full layer/token caching can add significant RAM requirements; recent methods (FreqCa (Liu et al., 9 Oct 2025)) address this via cumulative/frequency domain caching.
  • Quality Degradation at High Skip: Aggressive reuse or forecasting without adaptive error control leads to drift or perceptual artifacts.
  • Unified Theoretical Guarantees: Most systems rely on heuristics; deriving diffusion-theoretic error bounds for arbitrary caching/forecasting under varying noise schedules remains open (Liu et al., 22 Oct 2025).
  • Cross-Model Scheduling: Generalization of profiling-based or clusterwise solver assignment across tasks, data modalities, or architectures.
  • Integration with Other Accelerators: Combining caching with quantization, pruning, or fast samplers (e.g., DPM-Solver) can yield multiplicative speedups but demands advanced cost/error management (Liu et al., 3 Mar 2025, Liu et al., 22 Oct 2025).

Future research is focusing on adaptive cache intervals informed by online error signals, learning-to-cache or meta-cache schedulers, global memory-efficient buffer management, and extending diffusion caching to real-time multimodal, interactive, and large-scale distributed deployments. Systematic security and privacy assessments will also be increasingly critical as more commercial and open-source diffusion serving platforms adopt approximate or cross-user caching (Sun et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diffusion Caching.