TaylorSeer: Fast Diffusion Transformer Acceleration
- TaylorSeer is a training-free acceleration algorithm that uses truncated Taylor series expansions to predict intermediate diffusion model features.
- It employs finite-difference derivative estimation and last-block forecasting to achieve up to 5× speedups while reducing memory overhead.
- The method integrates prediction confidence gating to adaptively control error accumulation, balancing computational speed with model fidelity.
TaylorSeer is a class of training-free acceleration algorithms for diffusion models, specifically optimized for Transformer-based architectures such as Diffusion Transformers (DiTs). The fundamental innovation of TaylorSeer is to replace traditional feature caching—where intermediate features are stored and blindly reused across denoising timesteps—with predictive feature forecasting via truncated Taylor series expansions. TaylorSeer harnesses the smoothness and continuity of feature trajectories as a function of the diffusion timestep, allowing efficient and accurate prediction of features at future timesteps based on finite differences of previously computed features. This approach provides substantial reductions in inference time with minimal loss in sample quality, facilitating practical deployment of high-fidelity generative models in latency-sensitive scenarios.
1. Problem Setting and Motivation
Diffusion Transformers have set benchmarks in visual generation tasks but suffer from prohibitively high inference latency, which severely limits their usability in real-time and low-resource applications. Classical feature caching accelerates inference by storing and reusing features calculated at earlier timesteps. However, feature similarity decays exponentially with the interval between cached and reused timesteps, resulting in substantial prediction error and perceptual degradation (Liu et al., 10 Mar 2025, Guan et al., 4 Aug 2025). This motivates TaylorSeer's distinctive strategy: rather than naive reuse, employ a principled numerical method—Taylor expansion with finite-difference derivative estimation—to accurately forecast neural features at intermediate steps.
2. Mathematical Foundations of TaylorSeer
Let denote the output feature of layer at diffusion timestep , with . Assuming is -times differentiable, the Taylor series for at timestep is:
where is the remainder. To make this expansion practical, TaylorSeer estimates using forward finite differences over cached timesteps spaced by a reuse interval :
Each . The -th order TaylorSeer forecast of is:
This expansion allows prediction of intermediate features at arbitrary steps between forced full computations, replacing costly forward passes with cheap linear algebraic operations (Liu et al., 10 Mar 2025).
3. Last-Block Forecasting and Module-Level Caching
Early TaylorSeer implementations performed module-level caching and prediction, requiring storage and access of features and their finite differences for every Transformer block and internal submodule. The memory overhead scales as , with the number of blocks and the number of modules per block. To address this, recent work proposed last-block-only Taylor prediction, which caches and forecasts only the final block’s output features and their finite differences, reducing cache size by a factor of $3B$, where three submodules are typical (Guan et al., 4 Aug 2025). Empirical evaluation shows a reduction in GPU memory usage without meaningful loss of prediction fidelity.
4. Prediction Confidence Gating (PCG)
Fixed-interval Taylor forecasting risks error accumulation during unstable or nonlinear phases of the denoising trajectory. To mitigate this, TaylorSeer adopts a dynamic gating mechanism based on prediction confidence: at each predicted step, the output of the first Transformer block is approximated both analytically via Taylor expansion and directly via forward computation. Their relative norm error
is compared to a threshold . If , TaylorSeer skips direct computation of subsequent blocks and uses the forecast; otherwise, full computation is triggered, and the cache is reset (Guan et al., 4 Aug 2025). This adaptive error gating stabilizes quality at high acceleration ratios and minimizes perceptual degradation.
5. Algorithmic Workflow and Pseudocode
Sampling proceeds as follows:
- At every th forced-compute timestep: run full network, cache final block's output features, compute and cache finite differences up to order .
- For intermediate timesteps: predict last-block features using Taylor expansion with cached finite differences. Compute first-block error for PCG; skip direct block computation if error criterion satisfied.
- Update latent variable using predicted or computed denoising output.
Pseudocode excerpt (Liu et al., 10 Mar 2025, Guan et al., 4 Aug 2025):
1 2 3 4 5 6 7 8 9 10 |
for t = T down to 1: if forced_step(t): compute full features, update cache else: predict features by Taylor expansion if confidence_gate_accepts(): skip block computation, use forecast else: run full computation, reset cache update latent x_{t-1} |
6. Experimental Results and Empirical Analysis
TaylorSeer has demonstrated state-of-the-art acceleration on multiple benchmarks:
- On FLUX (image generation, 50 DDIM steps): 4.99 speedup, improvement in ImageReward ($1.0039$ vs $0.9898$) and CLIP ($19.427$ vs $19.604$) at , [(Liu et al., 10 Mar 2025), Table 1].
- On HunyuanVideo (video synthesis): 5.00 speedup, VBench vs baseline at , .
- On DiT-XL/2 (class-conditional ImageNet): 4.53 speedup, FID $2.65$ at , ; previous best approaches report FID [(Liu et al., 10 Mar 2025), Table 3].
- Confidence-gated TaylorSeer achieves 3.17 acceleration on FLUX with negligible ImageReward and SSIM drops, 2.36 on DiT-XL/2 (FID $2.34$ vs $2.32$ baseline), 4.14 on Wan Video (Guan et al., 4 Aug 2025).
Ablative analysis reveals:
- Higher-order Taylor ( or $3$) markedly reduces quality loss for large .
- Confidence gating trades off speed and stability, with controlling the adaptive regime.
- Memory reduction and computational overhead are both linear in Taylor order and the number of cached features.
| Benchmark | Speedup | FID/ImageReward/SSIM/VBench | TaylorSeer Setting |
|---|---|---|---|
| FLUX | 4.99× | 1.0039 (ImageReward) / 19.427(CLIP) | , |
| DiT-XL/2 | 4.53× | 2.65 (FID) | , |
| HunyuanVideo | 5.00× | 79.93% (VBench) | , |
| Wan Video | 4.14× | 79.0% (VBench), SSIM=.5495 | Confidence-Gated, |
7. Limitations, Trade-Offs, and Future Directions
TaylorSeer’s fidelity hinges on feature smoothness and the adequacy of the Taylor order . For highly nonlinear features or extreme intervals (), higher may be required, with increased memory and operations. Failure modes include accumulated remainder error if confidence gating is too permissive or in out-of-distribution regions. Memory overhead, while reduced in last-block caching, remains a concern in extremely resource-constrained deployments (Guan et al., 4 Aug 2025).
Potential extensions include learned error predictors for gating, adaptive per-block gating, and dynamic interval/order adjustment based on error history. These directions suggest TaylorSeer is an extensible paradigm for principled, training-free model acceleration under strict latency and memory constraints.
TaylorSeer thus represents a mathematically grounded, empirically validated method for fast and accurate diffusion model inference, achieving substantial speedups with negligible quality drop across a range of high-fidelity generative tasks (Liu et al., 10 Mar 2025, Guan et al., 4 Aug 2025).