EasyCache: Adaptive Caching for Diffusion Models
- EasyCache is a training-free, runtime-adaptive caching framework that accelerates video diffusion model inference by reusing stable transformation vectors observed during denoising.
- It integrates a caching module, controller with cumulative error monitoring, and seamless denoiser interfacing to balance computational demand and output fidelity.
- Benchmarks demonstrate up to 3.3× speedup with improved PSNR, SSIM, LPIPS, and competitive performance across diverse trajectories like OpenSora, Wan2.1, and HunyuanVideo.
EasyCache is a training-free, runtime-adaptive caching framework designed to accelerate inference in video (and image) diffusion models, specifically targeting DiT-based architectures. By reusing transformation vectors during denoising steps that exhibit empirical stability, EasyCache substantially reduces computational load without requiring offline profiling, retraining, or architectural modification. The framework achieves up to 2.1–3.3× speedup over baselines while preserving or enhancing output fidelity, as measured by PSNR, SSIM, and LPIPS. It is model-agnostic and compatible with widely used trajectories such as OpenSora, Wan2.1, and HunyuanVideo (Zhou et al., 3 Jul 2025).
1. Core Architectural Components
EasyCache operates as an auxiliary module positioned between the diffusion scheduler (which generates the latent trajectory ) and the DiT denoiser . Its three primary components are as follows:
- Caching Module: Stores the latest computed transformation vector (with ) and retrieves it for subsequent reuse.
- Controller: Maintains step-wise stability indicators , cumulative error sum , and determines, via a threshold , whether to execute a full Transformer pass or reuse the cached vector.
- Integration with Denoising Loop: Implements logic whereby, at each timestep , either the DiT is fully queried (updating the cache and resetting ) or the cached is applied to the current latent ().
The initial steps serve as a warm-up phase with full inference; caching is only enabled once the transformation rate empirically stabilizes.
2. Runtime-Adaptive Caching Mechanism
At the core, EasyCache exploits the observation that, following an initial non-linear phase, the relative transformation rate
remains stable across denoising steps. This stability enables the use of previously computed transformation vectors for all where the cumulative estimated error remains below the threshold .
The local stability indicator is defined as:
The cumulative error sum is then
Caching is performed when . Otherwise, a full reconstruction via is executed, and the cache is reset.
3. Algorithmic Workflow
The high-level pseudocode for EasyCache’s runtime loop is as follows (LaTeX-style notation):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input: Diffusion model %%%%22%%%%, total steps %%%%23%%%%, tolerance %%%%24%%%%, warm-up %%%%25%%%%, prompt %%%%26%%%%
Initialize %%%%27%%%%, %%%%28%%%%
Sample %%%%29%%%%
For %%%%30%%%% down to %%%%31%%%%:
%%%%32%%%%
If (%%%%33%%%% or %%%%34%%%% or %%%%35%%%%):
%%%%36%%%% // full pass
%%%%37%%%%, %%%%38%%%%, %%%%39%%%%
Else:
%%%%40%%%% // cache reuse
estimate %%%%41%%%%
%%%%42%%%%
Update %%%%43%%%%
Recompute %%%%44%%%%
Return final video frames |
This scheme ensures that expensive Transformer passes are minimized, and cache reuse is adaptively governed by online simulation fidelity metrics.
4. Quantitative Performance and Benchmarking
Empirical evaluation on multiple DiT-based video generation pipelines demonstrates significant reductions in latency and improvements in visual quality metrics compared to both unoptimized and prior-caching baselines. The following summarizes selected results:
| Model/Method | Latency (s) | Speedup | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|
| Open-Sora (T=30) | 44.90 | 1.00 | — | — | — |
| + TeaCache | 28.92 | 1.55 | 23.56 | 0.8433 | 0.1318 |
| + EasyCache | 21.21 | 2.12 | 23.95 | 0.8556 | 0.1235 |
| Wan2.1-1.3B (T=50) | 175.35 | 1.00 | — | — | — |
| + TeaCache | 87.77 | 2.00 | 22.57 | 0.8057 | 0.1277 |
| + EasyCache | 69.11 | 2.54 | 25.24 | 0.8337 | 0.0952 |
| HunyuanVideo (T=50) | 1124.3 | 1.00 | — | — | — |
| + TeaCache | 674.04 | 1.67 | 23.85 | 0.8185 | 0.1730 |
| + SVG (sparse attn.) | 802.70 | 1.40 | 26.57 | 0.8596 | 0.1368 |
| + EasyCache | 507.97 | 2.21 | 32.66 | 0.9313 | 0.0533 |
On FLUX.1-dev text-to-image pipelines (50 steps), EasyCache achieves a 4.64× speedup (vs. TeaCache’s 3.27×) with better FID and CLIP Score.
Across experiments, EasyCache consistently outperforms training-free baselines such as TeaCache, PAB, step-reduction, and static cache methods in both efficiency and fidelity metrics (Zhou et al., 3 Jul 2025).
5. Hyperparameter Sensitivity and Ablation Findings
Extensive ablation reveals the sensitivity to key hyperparameters and design variants on Wan2.1-1.3B:
- Tolerance Threshold : Lower yields higher fidelity but less acceleration (e.g., yields 1.61× speedup at PSNR=30.73, while enables 3.09× but with PSNR=21.67).
- Warm-up Duration : Optimal performance is achieved with steps; excessively short or long warm-up degrades speedup or fidelity.
- Caching Criteria: Output-relative and probabilistic reuse achieve marginally worse fidelity-speed tradeoffs compared to EasyCache's default error-accumulation strategy.
- Transformation Rate Update: Local updates provide optimal tradeoff; alternatives with global averaging or exponential moving average (EMA) underperform either in speed or fidelity.
These results confirm that runtime-adaptive control via cumulative error , brief stabilization (), and interval-wise computation yield the best balance between speedup and result quality (Zhou et al., 3 Jul 2025).
6. Implementation Considerations and Extensions
EasyCache is model-agnostic and integrates by wrapping DiT denoiser invocations within the diffusion loop. The required memory overhead is minimal—one latent-size vector plus a small number of scalars per sequence. Computational overhead is negligible due to the simplicity of required operations (vector norm, addition, and scalar summation).
No retraining, model code generation, or architecture changes are necessary. Tuning is limited to the tolerance threshold (typically 2–10%) and warm-up steps (5–15). The method can be composed with other per-step speedups such as SVG (sparse attention); on HunyuanVideo, SVG and EasyCache combined yield up to 3.33× overall speedup with only 1.1% PSNR drop.
By making adaptive caching decisions on the basis of runtime inference dynamics and omitting static heuristics or offline profiling, EasyCache establishes a new state of the art for training-free diffusion model acceleration (Zhou et al., 3 Jul 2025).