X-Slim: Cache Accelerator for Diffusion Models

Updated 21 December 2025

X-Slim is a unified, training-free accelerator that leverages caching of redundant computations across time, structure, and space in diffusion models.
It uses a dual-threshold controller to skip entire denoising steps and selectively refresh blocks and tokens while maintaining high visual fidelity.
Empirical evaluations demonstrate speedups up to 4.97× with negligible quality loss across tasks, establishing a new Pareto frontier in diffusion acceleration.

X-Slim (eXtreme-Slimming Caching) is a unified, training-free, cache-based accelerator designed for diffusion model inference, which systematically exploits redundant computation across temporal, structural, and spatial axes. By dynamically controlling cache usage via context-aware indicators and a dual-threshold strategy, X-Slim achieves substantial speedups for generative models with negligible perceptual loss, effectively advancing the achievable speed–quality frontier in large-scale diffusion-based synthesis tasks (Wen et al., 14 Dec 2025).

1. Motivation and Background

Diffusion models involve iterative denoising over $T$ timesteps, each executing a deep stack of $L$ transformer or U-Net blocks on $N$ input tokens. Inference computation and latency scale as $\mathcal{O}(T \cdot L \cdot N)$ . Primary cost drivers include: (a) temporal steps ( $T \sim 50$ in contemporary settings); (b) structural depth ( $L$ up to hundreds for high-fidelity tasks); and (c) spatial extent ( $N \gg 1$ k for high-resolution synthesis), with spatial FLOPs increasing quadratically due to self-attention.

Prior acceleration approaches target singular axes, such as step-level skipping (aggressive, prone to quality loss), block-level selection (safer, less savings), or token-level refreshing. These realize only local optima, leaving significant redundancy untapped.

X-Slim’s key premise is that diffusion model features exhibit strong similarity, not only across adjacent timesteps but also within certain blocks and for spatial tokens (especially background regions). Exploiting these redundancies multidimensionally enables more aggressive acceleration while minimizing the risk of error accumulation and perceptual degradation (Wen et al., 14 Dec 2025).

2. Core Architecture and Mechanisms

X-Slim is structured around a dual-threshold controller implementing a “push-then-polish” principle:

Reuse is “pushed” aggressively at the timestep level until error reaches an early-warning boundary.
Thereafter, lightweight block- and token-level refresh policies “polish” remaining accumulation.
Upon crossing the critical error threshold, full inference is triggered to refresh all cached features, resetting error.

At each step $t$ , cumulative reuse error is tracked via:

$E(t) = \sum_{k = t_{\mathrm{calc}} + 1}^{t} \frac{\lVert \Delta_k - \Delta_{k-1} \rVert_1}{\lVert \Delta_{k-1} \rVert_1}$

where $\Delta_k = O_k - I_k$ (true feature change at step $k$ ), and $t_{\mathrm{calc}}$ denotes the last step with full computation. Thresholds $\delta_1$ (“early-warning”) and $\delta_2$ (“critical”) govern the transition between coarse skipping, partial refresh, and full inference.

Temporal Level: Skipping entire denoising steps when $E(t) < \delta_1$ by reusing $\Delta_{t-1}$ .

Structural (Block) Level: For block $l$ , compute per-block relative change,

$e_t^{(l)} = \frac{\lVert O_t^{(l)} - I_t^{(l)} \rVert_1}{\lVert I_t^{(l)} \rVert_1}$

and accumulate $C_t^{(l)}$ over skipped blocks. Block-level refresh is triggered when $C_t^{(l)} > \delta_\text{blk}$ .

Spatial (Token) Level: For token $i$ , compute

$\mathrm{diff}_i(t) = \frac{1}{D} \sum_{d=1}^D |x_{i, t, d} - x_{i, t_{\mathrm{calc}}, d}|$

and only recompute the subset of tokens (top- $k$ by change) exceeding a threshold.

3. Mathematical Description and Threshold Calibration

Thresholds are set as $\tau_1 = \delta_1 = E_\mathrm{mid}$ and $\tau_2 = \delta_2 = E_\mathrm{max}$ , where $E_\mathrm{mid}$ corresponds to the plateau of the U-curve for $\lVert \Delta_k - \Delta_{k-1} \rVert_1 / \lVert \Delta_{k-1} \rVert_1$ (empirically calibrated per model/task) and $\delta_2$ is the maximal tolerable error (see ablation in supplementary materials).

The relative ratio $r = \tau_1/\tau_2$ is tuned, empirically $r \approx 0.9$ yields stable speed–quality trade-off. Error is accumulated until $E(t) \geq \tau_2$ , at which point full inference is run and the cache is reset.

Expected per-step cost is given by:

$C_\mathrm{avg} = (1 - R_\mathrm{time}) C_\mathrm{full} + R_\mathrm{time} \left[ (1 - R_\mathrm{block})\alpha C_0 + (1 - R_\mathrm{token})\beta C_0 \right]$

where $R_\mathrm{time}$ , $R_\mathrm{block}$ , $R_\mathrm{token}$ are respective reuse ratios and $\alpha$ , $\beta$ are per-unit costs for block and token refreshes. Speedup is $S \approx C_0 / C_\mathrm{avg}$ .

4. Algorithmic Realization

At each timestep:

Accumulate error $E \leftarrow E + \lVert \Delta_t - \Delta_{t-1} \rVert_1 / \lVert \Delta_{t-1} \rVert_1$ .
If $E < \tau_1$ (skip): $O_t \leftarrow I_t + \Delta_{t-1}$ .
If $E \geq \tau_2$ (full inference): $O_t \leftarrow \text{Net}(I_t)$ ; reset all deltas and set $E \leftarrow 0$ .
Else (refresh): recompute select blocks ( $C_t^{(l)} > \delta_\mathrm{blk}$ ) and top- $k$ high-change tokens; reuse cache for all other units.

Below is a workflow summary:

Mode	Trigger Condition	Action
Skip Step	$E(t) < \tau_1$	Step-level cache reuse
Refresh Step	$\tau_1 \leq E(t)<\tau_2$	Recompute selected blocks/tokens
Full Inference	$E(t) \geq \tau_2$	Run all layers, reset error/caches

5. Empirical Results and Comparative Performance

Empirical evaluation across diverse generators:

On FLUX.1-dev (T=50), X-Slim(C2F)-fast attains a 4.97 $\times$ speedup at ImageReward $=$ 0.9806 (Δ $=$ –0.0080), surpassing TeaCache (3.24 $\times$ ) and TaylorSeer (2.35 $\times$ ).
For HunyuanVideo (video, 50 steps/81 frames), achieves 3.52 $\times$ acceleration, VBench score of 81.69%, LPIPS $=$ 0.1638.
DiT-XL/2 (ImageNet class $\rightarrow$ image): 3.13 $\times$ acceleration, FID $=$ 2.42, sFID $=$ 4.59.

Observed effects:

X-Slim strictly dominates alternative caching methods across the latency–quality curve, establishing a new Pareto frontier.
Failure manifests as structural artifacts or blurring when full refresh is too infrequent, or when aggressive reuse is employed outside the robust error corridor (before $\tau_1$ or after $\tau_2$ ).

6. Integration Guidelines and Usage Considerations

Key steps for effective deployment:

Calibrate thresholds using U-curve analysis of relative feature changes on a small validation set.
Set $\tau_2$ for full computation every 8–15 skipped steps; then select $\tau_1 = r \cdot \tau_2$ with $r\in[0.8, 0.95]$ .
During inference, insert the dual-threshold controller into the sampling loop; maintain caches at block and token granularity.
For refresh steps, set $\delta_\mathrm{blk}$ and token refresh ratio to target $\sim$ 80% cache reuse.

The X-Slim procedure is plug-and-play and requires no model retraining or fine-tuning.

7. Context and Significance

X-Slim introduces the first framework to exploit cacheable redundancy jointly across time, structure, and space in diffusion models. By combining aggressive timestep skipping with targeted structural and spatial refreshes, and by providing practical, empirical guidance for parameter selection, X-Slim delivers up to 4.97 $\times$ acceleration for image generation, 3.52 $\times$ for video, and 3.13 $\times$ in class $\rightarrow$ image tasks, all while maintaining high perceptual fidelity (Wen et al., 14 Dec 2025). This approach represents a rigorous advancement of the diffusion acceleration landscape, with broad applicability to transformer-based generative architectures.

PDF Markdown Chat (Pro)

References (1)

No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to X-Slim (eXtreme-Slimming Caching).