X-Slim: Cache Accelerator for Diffusion Models
- X-Slim is a unified, training-free accelerator that leverages caching of redundant computations across time, structure, and space in diffusion models.
- It uses a dual-threshold controller to skip entire denoising steps and selectively refresh blocks and tokens while maintaining high visual fidelity.
- Empirical evaluations demonstrate speedups up to 4.97× with negligible quality loss across tasks, establishing a new Pareto frontier in diffusion acceleration.
X-Slim (eXtreme-Slimming Caching) is a unified, training-free, cache-based accelerator designed for diffusion model inference, which systematically exploits redundant computation across temporal, structural, and spatial axes. By dynamically controlling cache usage via context-aware indicators and a dual-threshold strategy, X-Slim achieves substantial speedups for generative models with negligible perceptual loss, effectively advancing the achievable speed–quality frontier in large-scale diffusion-based synthesis tasks (Wen et al., 14 Dec 2025).
1. Motivation and Background
Diffusion models involve iterative denoising over timesteps, each executing a deep stack of transformer or U-Net blocks on input tokens. Inference computation and latency scale as . Primary cost drivers include: (a) temporal steps ( in contemporary settings); (b) structural depth ( up to hundreds for high-fidelity tasks); and (c) spatial extent (k for high-resolution synthesis), with spatial FLOPs increasing quadratically due to self-attention.
Prior acceleration approaches target singular axes, such as step-level skipping (aggressive, prone to quality loss), block-level selection (safer, less savings), or token-level refreshing. These realize only local optima, leaving significant redundancy untapped.
X-Slim’s key premise is that diffusion model features exhibit strong similarity, not only across adjacent timesteps but also within certain blocks and for spatial tokens (especially background regions). Exploiting these redundancies multidimensionally enables more aggressive acceleration while minimizing the risk of error accumulation and perceptual degradation (Wen et al., 14 Dec 2025).
2. Core Architecture and Mechanisms
X-Slim is structured around a dual-threshold controller implementing a “push-then-polish” principle:
- Reuse is “pushed” aggressively at the timestep level until error reaches an early-warning boundary.
- Thereafter, lightweight block- and token-level refresh policies “polish” remaining accumulation.
- Upon crossing the critical error threshold, full inference is triggered to refresh all cached features, resetting error.
At each step , cumulative reuse error is tracked via:
where (true feature change at step ), and denotes the last step with full computation. Thresholds (“early-warning”) and (“critical”) govern the transition between coarse skipping, partial refresh, and full inference.
Temporal Level: Skipping entire denoising steps when by reusing .
Structural (Block) Level: For block , compute per-block relative change,
and accumulate over skipped blocks. Block-level refresh is triggered when .
Spatial (Token) Level: For token , compute
and only recompute the subset of tokens (top- by change) exceeding a threshold.
3. Mathematical Description and Threshold Calibration
Thresholds are set as and , where corresponds to the plateau of the U-curve for (empirically calibrated per model/task) and is the maximal tolerable error (see ablation in supplementary materials).
The relative ratio is tuned, empirically yields stable speed–quality trade-off. Error is accumulated until , at which point full inference is run and the cache is reset.
Expected per-step cost is given by:
where , , are respective reuse ratios and , are per-unit costs for block and token refreshes. Speedup is .
4. Algorithmic Realization
At each timestep:
- Accumulate error .
- If (skip): .
- If (full inference): ; reset all deltas and set .
- Else (refresh): recompute select blocks () and top- high-change tokens; reuse cache for all other units.
Below is a workflow summary:
| Mode | Trigger Condition | Action |
|---|---|---|
| Skip Step | Step-level cache reuse | |
| Refresh Step | Recompute selected blocks/tokens | |
| Full Inference | Run all layers, reset error/caches |
5. Empirical Results and Comparative Performance
Empirical evaluation across diverse generators:
- On FLUX.1-dev (T=50), X-Slim(C2F)-fast attains a 4.97 speedup at ImageReward0.9806 (Δ–0.0080), surpassing TeaCache (3.24) and TaylorSeer (2.35).
- For HunyuanVideo (video, 50 steps/81 frames), achieves 3.52 acceleration, VBench score of 81.69%, LPIPS0.1638.
- DiT-XL/2 (ImageNet classimage): 3.13 acceleration, FID2.42, sFID4.59.
Observed effects:
- X-Slim strictly dominates alternative caching methods across the latency–quality curve, establishing a new Pareto frontier.
- Failure manifests as structural artifacts or blurring when full refresh is too infrequent, or when aggressive reuse is employed outside the robust error corridor (before or after ).
6. Integration Guidelines and Usage Considerations
Key steps for effective deployment:
- Calibrate thresholds using U-curve analysis of relative feature changes on a small validation set.
- Set for full computation every 8–15 skipped steps; then select with .
- During inference, insert the dual-threshold controller into the sampling loop; maintain caches at block and token granularity.
- For refresh steps, set and token refresh ratio to target 80% cache reuse.
The X-Slim procedure is plug-and-play and requires no model retraining or fine-tuning.
7. Context and Significance
X-Slim introduces the first framework to exploit cacheable redundancy jointly across time, structure, and space in diffusion models. By combining aggressive timestep skipping with targeted structural and spatial refreshes, and by providing practical, empirical guidance for parameter selection, X-Slim delivers up to 4.97 acceleration for image generation, 3.52 for video, and 3.13 in classimage tasks, all while maintaining high perceptual fidelity (Wen et al., 14 Dec 2025). This approach represents a rigorous advancement of the diffusion acceleration landscape, with broad applicability to transformer-based generative architectures.