Feature Caching in Generative Models
- Feature caching methods are techniques that reuse and forecast intermediate representations across timesteps to speed up inference in generative models.
- Subspace-aware and cluster-driven approaches like SVD-Cache leverage low-dimensional dynamics to achieve up to 6x speedups with minimal impact on output fidelity.
- Advanced implementations dynamically adjust cache strategies based on token sensitivity and error correction mechanisms, balancing efficiency with high sample quality.
Feature Caching Methods
Feature caching is a set of techniques designed to accelerate the inference phase of large neural architectures by reusing or forecasting intermediate representations, significantly reducing computational costs without retraining. These methods have become foundational in the practical deployment of generative diffusion models, especially in image, video, and molecular synthesis, where iterative sampling mechanisms require repeated forward passes through deep transformer or autoregressive networks (Chen et al., 12 Jan 2026). Caching leverages both temporal and spatial redundancies that emerge during the sampling trajectory, but sophisticated approaches must account for the heterogeneous evolution of different feature components to maintain sample fidelity at high acceleration ratios.
1. Principles of Feature Caching in Iterative Generative Models
Feature caching arises from the observation that the intermediate activations computed during adjacent timesteps of iterative samplers—such as those found in diffusion transformers (DiTs) and autoregressive generative models—are often highly similar due to the gradual noise reduction (denoising) process. At each timestep , the model evaluates a set of feature maps at various layers or blocks over the noisy latent . The naive baseline for inference, which recomputes every block at every timestep, incurs a cost proportional to the total number of steps . Caching methods instead attempt to bypass this cost by storing intermediate features at reference timesteps, and reusing or predicting those features at subsequent steps (Chen et al., 12 Jan 2026, Liu et al., 15 Sep 2025, Zou et al., 2024).
The simplest form involves either verbatim reuse of cached features ("cache-then-reuse") or temporal extrapolation ("cache-then-forecast") of features based on previous activations. Some recent approaches demonstrate that certain subspaces or token clusters can be selectively reused or forecasted, leading to further cost reduction with minimal impact on output quality (Zheng et al., 5 Oct 2025, Zheng et al., 12 Sep 2025).
2. Subspace-Aware and Dimension-Wise Feature Caching
Uniform caching—reuse or global prediction across all feature dimensions—frequently leads to error accumulation due to the non-homogeneous evolution of high-dimensional activations. Empirical studies reveal that diffusion feature spaces possess low-dimensional principal subspaces exhibiting smooth, predictable dynamics, and high-dimensional residual subspaces characterized by low energy and volatile oscillations (Chen et al., 12 Jan 2026).
SVD-Cache (Chen et al., 12 Jan 2026) introduces a subspace-aware framework:
- Apply singular value decomposition (SVD) to reshape feature tensor into principal and residual components.
- The principal (low-rank) subspace, covering of total energy, is forecasted via exponential moving average (EMA).
- The residual subspace is directly reused, bypassing costly prediction steps.
Dimension-wise or cluster-wise approaches (e.g., HyCa (Zheng et al., 5 Oct 2025)) further partition feature channels according to local temporal dynamics and assign a numerically appropriate ODE solver—explicit for smooth clusters, implicit for stiff clusters. This hybrid strategy effectively balances accuracy and acceleration by modeling feature evolutions as a set of coupled ODEs and selecting solvers via offline profiling and k-means clustering.
3. Spatiotemporal and Token/Cluster-Level Caching
Standard caching exploits only temporal coherence, ignoring spatial redundancies. Cluster-driven caching methods, notably ClusCa (Zheng et al., 12 Sep 2025), apply k-means clustering to token features at each timestep, computing only one representative per spatial cluster. Non-representative tokens are updated using the cluster representative, achieving up to reduction in per-step computation.
Token-wise selection (ToCa (Zou et al., 2024), DaTo (Zhang et al., 2024)) adapts cache ratios by token, layer depth, and structure type. Caching sensitivity scores (e.g., attention influence, cross-attention entropy, reuse-frequency freshness, spatial uniformity) allow dynamic skipping of only low-impact tokens. These methods systematically reduce computational complexity while maintaining detail in critical spatial regions.
Some frameworks (X-Slim (Wen et al., 14 Dec 2025)) integrate temporal, structural, and spatial caching under dual-threshold error controllers, pushing step-level reuse until a warning threshold, then "polishing" errors via selective block/token refreshes before a critical reset.
4. Forecasting with Advanced Numerical Schemes
Feature forecasting extends beyond naive extrapolation. HiCache (Feng et al., 23 Aug 2025) leverages the empirical Gaussianity of DiT feature derivatives and performs cache extrapolation using scaled Hermite polynomial bases, which are theoretically optimal for Gaussian-correlated processes (by the Karhunen–Loève theorem). This dual-scaling mechanism—scaling both input and polynomial coefficients—prevents numerical instability and sharply reduces error relative to Taylor basis methods.
FoCa (Zheng et al., 22 Aug 2025) frames the caching process as numerically solving an ODE in hidden-feature space, combining a multistep backward-difference predictor (BDF2) with a Heun (trapezoidal) corrector to robustly integrate feature trajectories, effectively curbing forecast error accumulation at large skip intervals.
Speculative caching (SpeCa (Liu et al., 15 Sep 2025)) employs draft prediction followed by parameter-free verification at deep layers, enabling real-time acceptance or rejection of speculative features per sample, with dynamic adaptive computation allocation.
5. Error Accumulation, Exposure Bias, and Correction Mechanisms
Aggressive caching over long intervals may lead to severe error propagation or exposure bias—systematic deviation between denoiser predictions at inference versus training (Zou et al., 10 Mar 2025). EB-Cache addresses this by adaptive cache table generation: off-line grid search constructs per-timestep caching strategies tuned to the local severity of exposure bias, coupled with noise scaling to partially restore alignment between inference trajectories and training-time behavior.
Gradient-Optimized Cache (GOC (Qiu et al., 7 Mar 2025)) propagates finite-difference approximations of the loss gradient and applies inflection-aware correction. GOD filters identify trajectory inflection points where correction could inject conflicting updates, and only apply gradient adjustments in safe regions.
Constraint-aware frameworks (ProCache (Cao et al., 19 Dec 2025)) construct non-uniform schedules (binary vectors specifying full compute vs. cache at each step) via offline constrained sampling and empirical FID evaluation, combined with selective partial updates to deep blocks and high-attention tokens to constrain error drift at minimal overhead.
6. Applications and Evaluations Across Domains
Feature caching extends beyond image and video models. In molecular geometry generation, predictive caching operates on SE(3)-equivariant backbones by forecasting costly last-layer outputs via finite-difference or Adams–Bashforth schemes, directly compatible with pretrained models and orthogonal to training-based accelerations (Sommer et al., 6 Oct 2025).
In personalized generation (DreamCache (Aiello et al., 2024)), single-step feature extraction from reference images suffices to inject high-quality multi-resolution features into a frozen backbone during sampling by lightweight cross-attention adapters.
Frequency-aware methods (FreqCa (Liu et al., 9 Oct 2025)) separate low-frequency (structural) and high-frequency (detail) bands—reusing low-frequency features by similarity and forecasting high-frequency components by Hermite interpolation—combined with cumulative residual feature (CRF) caching to cut memory usage by 99% while preserving sample fidelity.
Masked autoregressive models (LazyMAR (Yan et al., 16 Mar 2025)) apply token redundancy and condition redundancy to both self-attention and conditional branches, selectively recomputing only a small token subset per step and caching difference vectors in classifier-free guidance, delivering speedups at near-baseline FID.
7. Performance, Trade-Offs, and Limitations
Recent empirical results demonstrate near-lossless acceleration across diverse models:
- SVD-Cache achieves up to speedup on FLUX and on HunyuanVideo, retaining or surpassing ImageReward and CLIP scores (Chen et al., 12 Jan 2026).
- Cluster-driven and token-wise methods (ClusCa (Zheng et al., 12 Sep 2025), ToCa (Zou et al., 2024), DaTo (Zhang et al., 2024)) achieve – acceleration with minimal or reduced FID.
- HiCache (Feng et al., 23 Aug 2025) and FoCa (Zheng et al., 22 Aug 2025) consistently outperform TaylorSeer, maintaining sharper generative artifacts at high forecast intervals.
Fundamental trade-offs remain:
- Uniform caching rapidly accumulates error, especially in high-variance directions.
- Subspace-aware, token-wise, and cluster/pattern-scheduled caching introduce overhead in offline profiling, online scoring, or occasional full recomputation, but these are amortized by acceleration gains.
- Aggressive token/block skipping can degrade fidelity in fine detail regions, mitigated by adaptive correction, profile-guided block selection, or dual-threshold controllers.
Feature caching frameworks have generally proven robust across sampling schemes, architectures (DiT, DiffU-Net, MAR), and domains (video, molecules, personalization). Limitations appear primarily in settings with degenerate input distributions (variance-shifting prompts), extreme acceleration ratios, or where feature evolution statistics vary sharply across content.
Overall, feature caching methods constitute an essential part of the modern generative model acceleration toolkit, and ongoing research seeks further refinement in adaptive scheduling, subspace identification, error correction, and integration with advanced attention or quantization techniques.