Papers
Topics
Authors
Recent
2000 character limit reached

X-Slim: Cache Accelerator for Diffusion Models

Updated 21 December 2025
  • X-Slim is a unified, training-free accelerator that leverages caching of redundant computations across time, structure, and space in diffusion models.
  • It uses a dual-threshold controller to skip entire denoising steps and selectively refresh blocks and tokens while maintaining high visual fidelity.
  • Empirical evaluations demonstrate speedups up to 4.97× with negligible quality loss across tasks, establishing a new Pareto frontier in diffusion acceleration.

X-Slim (eXtreme-Slimming Caching) is a unified, training-free, cache-based accelerator designed for diffusion model inference, which systematically exploits redundant computation across temporal, structural, and spatial axes. By dynamically controlling cache usage via context-aware indicators and a dual-threshold strategy, X-Slim achieves substantial speedups for generative models with negligible perceptual loss, effectively advancing the achievable speed–quality frontier in large-scale diffusion-based synthesis tasks (Wen et al., 14 Dec 2025).

1. Motivation and Background

Diffusion models involve iterative denoising over TT timesteps, each executing a deep stack of LL transformer or U-Net blocks on NN input tokens. Inference computation and latency scale as O(TLN)\mathcal{O}(T \cdot L \cdot N). Primary cost drivers include: (a) temporal steps (T50T \sim 50 in contemporary settings); (b) structural depth (LL up to hundreds for high-fidelity tasks); and (c) spatial extent (N1N \gg 1k for high-resolution synthesis), with spatial FLOPs increasing quadratically due to self-attention.

Prior acceleration approaches target singular axes, such as step-level skipping (aggressive, prone to quality loss), block-level selection (safer, less savings), or token-level refreshing. These realize only local optima, leaving significant redundancy untapped.

X-Slim’s key premise is that diffusion model features exhibit strong similarity, not only across adjacent timesteps but also within certain blocks and for spatial tokens (especially background regions). Exploiting these redundancies multidimensionally enables more aggressive acceleration while minimizing the risk of error accumulation and perceptual degradation (Wen et al., 14 Dec 2025).

2. Core Architecture and Mechanisms

X-Slim is structured around a dual-threshold controller implementing a “push-then-polish” principle:

  • Reuse is “pushed” aggressively at the timestep level until error reaches an early-warning boundary.
  • Thereafter, lightweight block- and token-level refresh policies “polish” remaining accumulation.
  • Upon crossing the critical error threshold, full inference is triggered to refresh all cached features, resetting error.

At each step tt, cumulative reuse error is tracked via:

E(t)=k=tcalc+1tΔkΔk11Δk11E(t) = \sum_{k = t_{\mathrm{calc}} + 1}^{t} \frac{\lVert \Delta_k - \Delta_{k-1} \rVert_1}{\lVert \Delta_{k-1} \rVert_1}

where Δk=OkIk\Delta_k = O_k - I_k (true feature change at step kk), and tcalct_{\mathrm{calc}} denotes the last step with full computation. Thresholds δ1\delta_1 (“early-warning”) and δ2\delta_2 (“critical”) govern the transition between coarse skipping, partial refresh, and full inference.

Temporal Level: Skipping entire denoising steps when E(t)<δ1E(t) < \delta_1 by reusing Δt1\Delta_{t-1}.

Structural (Block) Level: For block ll, compute per-block relative change,

et(l)=Ot(l)It(l)1It(l)1e_t^{(l)} = \frac{\lVert O_t^{(l)} - I_t^{(l)} \rVert_1}{\lVert I_t^{(l)} \rVert_1}

and accumulate Ct(l)C_t^{(l)} over skipped blocks. Block-level refresh is triggered when Ct(l)>δblkC_t^{(l)} > \delta_\text{blk}.

Spatial (Token) Level: For token ii, compute

diffi(t)=1Dd=1Dxi,t,dxi,tcalc,d\mathrm{diff}_i(t) = \frac{1}{D} \sum_{d=1}^D |x_{i, t, d} - x_{i, t_{\mathrm{calc}}, d}|

and only recompute the subset of tokens (top-kk by change) exceeding a threshold.

3. Mathematical Description and Threshold Calibration

Thresholds are set as τ1=δ1=Emid\tau_1 = \delta_1 = E_\mathrm{mid} and τ2=δ2=Emax\tau_2 = \delta_2 = E_\mathrm{max}, where EmidE_\mathrm{mid} corresponds to the plateau of the U-curve for ΔkΔk11/Δk11\lVert \Delta_k - \Delta_{k-1} \rVert_1 / \lVert \Delta_{k-1} \rVert_1 (empirically calibrated per model/task) and δ2\delta_2 is the maximal tolerable error (see ablation in supplementary materials).

The relative ratio r=τ1/τ2r = \tau_1/\tau_2 is tuned, empirically r0.9r \approx 0.9 yields stable speed–quality trade-off. Error is accumulated until E(t)τ2E(t) \geq \tau_2, at which point full inference is run and the cache is reset.

Expected per-step cost is given by:

Cavg=(1Rtime)Cfull+Rtime[(1Rblock)αC0+(1Rtoken)βC0]C_\mathrm{avg} = (1 - R_\mathrm{time}) C_\mathrm{full} + R_\mathrm{time} \left[ (1 - R_\mathrm{block})\alpha C_0 + (1 - R_\mathrm{token})\beta C_0 \right]

where RtimeR_\mathrm{time}, RblockR_\mathrm{block}, RtokenR_\mathrm{token} are respective reuse ratios and α\alpha, β\beta are per-unit costs for block and token refreshes. Speedup is SC0/CavgS \approx C_0 / C_\mathrm{avg}.

4. Algorithmic Realization

At each timestep:

  1. Accumulate error EE+ΔtΔt11/Δt11E \leftarrow E + \lVert \Delta_t - \Delta_{t-1} \rVert_1 / \lVert \Delta_{t-1} \rVert_1.
  2. If E<τ1E < \tau_1 (skip): OtIt+Δt1O_t \leftarrow I_t + \Delta_{t-1}.
  3. If Eτ2E \geq \tau_2 (full inference): OtNet(It)O_t \leftarrow \text{Net}(I_t); reset all deltas and set E0E \leftarrow 0.
  4. Else (refresh): recompute select blocks (Ct(l)>δblkC_t^{(l)} > \delta_\mathrm{blk}) and top-kk high-change tokens; reuse cache for all other units.

Below is a workflow summary:

Mode Trigger Condition Action
Skip Step E(t)<τ1E(t) < \tau_1 Step-level cache reuse
Refresh Step τ1E(t)<τ2\tau_1 \leq E(t)<\tau_2 Recompute selected blocks/tokens
Full Inference E(t)τ2E(t) \geq \tau_2 Run all layers, reset error/caches

5. Empirical Results and Comparative Performance

Empirical evaluation across diverse generators:

  • On FLUX.1-dev (T=50), X-Slim(C2F)-fast attains a 4.97×\times speedup at ImageReward==0.9806 (Δ==–0.0080), surpassing TeaCache (3.24×\times) and TaylorSeer (2.35×\times).
  • For HunyuanVideo (video, 50 steps/81 frames), achieves 3.52×\times acceleration, VBench score of 81.69%, LPIPS==0.1638.
  • DiT-XL/2 (ImageNet class\rightarrowimage): 3.13×\times acceleration, FID==2.42, sFID==4.59.

Observed effects:

  • X-Slim strictly dominates alternative caching methods across the latency–quality curve, establishing a new Pareto frontier.
  • Failure manifests as structural artifacts or blurring when full refresh is too infrequent, or when aggressive reuse is employed outside the robust error corridor (before τ1\tau_1 or after τ2\tau_2).

6. Integration Guidelines and Usage Considerations

Key steps for effective deployment:

  • Calibrate thresholds using U-curve analysis of relative feature changes on a small validation set.
  • Set τ2\tau_2 for full computation every 8–15 skipped steps; then select τ1=rτ2\tau_1 = r \cdot \tau_2 with r[0.8,0.95]r\in[0.8, 0.95].
  • During inference, insert the dual-threshold controller into the sampling loop; maintain caches at block and token granularity.
  • For refresh steps, set δblk\delta_\mathrm{blk} and token refresh ratio to target \sim80% cache reuse.

The X-Slim procedure is plug-and-play and requires no model retraining or fine-tuning.

7. Context and Significance

X-Slim introduces the first framework to exploit cacheable redundancy jointly across time, structure, and space in diffusion models. By combining aggressive timestep skipping with targeted structural and spatial refreshes, and by providing practical, empirical guidance for parameter selection, X-Slim delivers up to 4.97×\times acceleration for image generation, 3.52×\times for video, and 3.13×\times in class\rightarrowimage tasks, all while maintaining high perceptual fidelity (Wen et al., 14 Dec 2025). This approach represents a rigorous advancement of the diffusion acceleration landscape, with broad applicability to transformer-based generative architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to X-Slim (eXtreme-Slimming Caching).