Modality-Aware Step Caching

Updated 12 February 2026

Modality-Aware Step Caching is a set of inference acceleration strategies that dynamically adapts cache reuse based on modality-specific characteristics.
It employs layer/step-wise similarity measures and per-modality token scoring to reduce redundant computations without retraining models.
Empirical results demonstrate up to 1.6× speedup and 80% memory reduction across tasks, maintaining near-lossless quality.

Modality-Aware Step Caching refers to a class of inference-time acceleration strategies in neural sequence models where the storage and reuse of intermediate states (“caches”) is dynamically adapted per step and per modality. These approaches leverage the distinct statistical and computational structures of modalities (e.g., text, vision, audio) to balance cache memory usage, speed of computation, and task-specific accuracy. Initially emerging within Transformer-based architectures for diffusion, vision-language, and multimodal models, modality-aware step caching generalizes standard step-wise KV cache reuse by introducing fine-grained, per-layer, per-step controls—often guided by empirical analyses of cross-timestep similarity, modal sparsity, or layer/head sensitivity—enabling substantial compute savings during inference across modalities without retraining or architectural modification.

1. Foundations and Motivating Observations

Modality-aware step caching is motivated by two core empirical findings. First, in iterative or autoregressive models (e.g., diffusion Transformers, autoregressive vision-LLMs), intermediate representations—particularly at adjacent steps—are often highly similar due to smooth denoising updates or the localness of multimodal dependencies. Second, the relative importance of tokens and layers varies significantly across modalities (vision, language, audio) and phases (prefill, decoding) (Liu et al., 2024, Qin et al., 15 Dec 2025, Tu et al., 2024, Li et al., 6 Jun 2025). Exploiting these patterns enables dynamic, targeted reuse of cached data to accelerate inference.

For example, in diffusion models, activations at timestep $t$ are nearly identical to those at $t{-}1$ , with cosine similarities $S_t^{(\ell)}{\approx}0.99$ for most layers and timesteps; this motivates skipping full recomputation when the per-layer “drift” is below threshold (Liu et al., 2024). In vision-LLMs, attention patterns show acute differences between visual and text tokens, and per-modality sparsity and importance can be directly measured and exploited (Tu et al., 2024, Qin et al., 15 Dec 2025, Li et al., 6 Jun 2025).

2. Core Algorithms and Mathematical Framework

At its core, modality-aware step caching is instantiated through three interrelated algorithmic mechanisms:

2.1 Layer/Step-wise Similarity and Cache Reuse

For diffusion transformers, SmoothCache formalizes caching as follows. At each step $t$ , the activation $L_t^{(\ell)}\in\mathbb R^d$ (post-residual, pre-layernorm) is compared to its predecessor $L_{t-1}^{(\ell)}$ , tracking both cosine similarity and relative $\ell_1$ representation error:

$S_t^{(\ell)} = \frac{\langle L_t^{(\ell)}, L_{t-1}^{(\ell)}\rangle}{\|L_t^{(\ell)}\|_2 \|L_{t-1}^{(\ell)}\|_2}$

$e_t^{(\ell)} = \frac{\|L_t^{(\ell)} - L_{t-1}^{(\ell)}\|_1}{\|L_t^{(\ell)}\|_1}$

If $e_t^{(\ell)} < \tau_{\ell}$ (threshold from calibration), caching is triggered, and computation is skipped (Liu et al., 2024).

2.2 Modality-Dependent Budgeting and Token Selection

In vision-LLMs, cache compression leverages per-layer/step token importance, sparsity, and modality assignment, employing mechanisms such as:

Sparsity-aware budgeting: Allocate per-layer budgets $t{-}1$ 0 proportional to local attention density $t{-}1$ 1; cache only a total fraction $t{-}1$ 2 of keys/values:

$t{-}1$ 3

(Tu et al., 2024)

Modality-aware token scoring: Accumulate attention over “post-vision” queries for language tokens or aggregate proxy-attention for each modality, yielding dynamic importance scores:

$t{-}1$ 4

where $t{-}1$ 5 is the attention matrix over relevant queries.

Per-head, per-modality adaptation: In MadaKV, each attention head $t{-}1$ 6 tracks a dynamic modality-preference score $t{-}1$ 7 by aggregating proxy attention across modalities $t{-}1$ 8:

$t{-}1$ 9

$S_t^{(\ell)}{\approx}0.99$ 0

updated with exponential moving average (Li et al., 6 Jun 2025).

Hierarchical compensation: When token retention at a given layer (or head-modality pair) deviates from its preallocated budget due to importance, adjustments are propagated to subsequent layers.

2.3 Stepwise Dynamic Recompute Policies

VLCache introduces a dynamic, layer-aware strategy, formalizing error propagation and sensitivity profiling:

For token $S_t^{(\ell)}{\approx}0.99$ 1 at layer $S_t^{(\ell)}{\approx}0.99$ 2, cumulative reuse error is:

$S_t^{(\ell)}{\approx}0.99$ 3

Sensitivity $S_t^{(\ell)}{\approx}0.99$ 4 (mean squared error on outputs when recomputing only a fraction $S_t^{(\ell)}{\approx}0.99$ 5 of the tokens) is measured, and a global optimization minimizes $S_t^{(\ell)}{\approx}0.99$ 6 subject to global compute budget and monotonicity constraints.

Parameters $S_t^{(\ell)}{\approx}0.99$ 7 are greedily increased to maximize coverage subject to a target recompute budget $S_t^{(\ell)}{\approx}0.99$ 8 (Qin et al., 15 Dec 2025).

3. Calibration and Integration Procedures

The effectiveness of modality-aware step caching requires explicit calibration and integration into existing inference pipelines.

Calibration: For SmoothCache, a one-shot calibration pass collects empirical error statistics $S_t^{(\ell)}{\approx}0.99$ 9 across layers/timesteps over a small sample set ( $t$ 0– $t$ 1), from which per-layer thresholds $t$ 2 are derived subject to a global expected error budget $t$ 3.
Budget allocation: In VL-Cache, attention-derived sparsity $t$ 4 is computed post-prefill; per-layer budgets $t$ 5 are then allocated in closed form for desired $t$ 6.
Token scoring and eviction: Token importance scores are computed from attention matrices, and only the top $t$ 7 per layer or per modality are retained, with hard eviction of the remainder (Tu et al., 2024, Li et al., 6 Jun 2025).
Dynamic update: For step caching in multimodal LLMs, per-step dynamic adaptation of retention criteria and inter-layer compensation is implemented using an exponential moving average and real-time budget drift monitoring.

Integration is generally orthogonal to solver choice (e.g., DDIM, DPM-Solver++, Rectified Flow) and remains compatible with quantization or pruning techniques.

4. Empirical Results Across Modalities

The practical impact of modality-aware step caching has been demonstrated across image, video, audio, and vision-language tasks.

Model/Task	Compute Reduction / Speedup	Quality Degradation	Reference
DiT-XL (Image, DDIM 50)	8.7% latency ↓ ( $t$ 8=0.08), up to 42% ↓ ( $t$ 9=0.18)	FID matches up to 24% MAC ↓; rises to 2.65 on 42% MAC ↓	(Liu et al., 2024)
Open-Sora (Video, Rectified Flow 30)	8% speedup	VBench: 79.36% → 78.10%	(Liu et al., 2024)
Stable Audio Open (Audio, DPM-Solver++ 100)	19–34% speedup ( $L_t^{(\ell)}\in\mathbb R^d$ 0=0.15–0.30)	CLAP/FD(KL) within ±2%	(Liu et al., 2024)
Qwen3-VL-8B (Vision-Language)	1.7–1.9× TTFT speedup (static), 1.5–1.9× (dynamic)	MeanAcc unchanged (74.04% → 74.42%)	(Qin et al., 15 Dec 2025)
LLaVA-based VLMs (VL-Cache)	7.08× decoding speedup ( $L_t^{(\ell)}\in\mathbb R^d$ 1=0.10)	≥98% accuracy retained	(Tu et al., 2024)
Multimodal LLMs (MadaKV, η= 0.2)	80% memory ↓; 1.42×–1.62× decoding speed	Full-accuracy on MileBench; +5–6% over baselines	(Li et al., 6 Jun 2025)

Across these studies, reducing cache size by 80–95% typically leads to quality losses well within 2% for robust settings and can result in direct 1.2–16× speedups in latency, especially in long-context or heavy-vision input scenarios.

5. Practical Design Guidelines and Hyperparameter Selection

Several empirical findings inform parameterization and deployment:

Calibration set size: $L_t^{(\ell)}\in\mathbb R^d$ 2 is often sufficient for stable curves; $L_t^{(\ell)}\in\mathbb R^d$ 3– $L_t^{(\ell)}\in\mathbb R^d$ 4 gives diminishing returns (Liu et al., 2024).
Error/accuracy trade-off: Starting $L_t^{(\ell)}\in\mathbb R^d$ 5– $L_t^{(\ell)}\in\mathbb R^d$ 6 (SmoothCache) and $L_t^{(\ell)}\in\mathbb R^d$ 7– $L_t^{(\ell)}\in\mathbb R^d$ 8 (KV caching) achieves near-lossless performance; larger values enable more aggressive compression at expense of quality (Liu et al., 2024, Li et al., 6 Jun 2025).
Per-step skip $L_t^{(\ell)}\in\mathbb R^d$ 9: Most benefits accrue for $L_{t-1}^{(\ell)}$ 0– $L_{t-1}^{(\ell)}$ 1 (distance between cached and current step); if error remains low, larger $L_{t-1}^{(\ell)}$ 2 may be used (Liu et al., 2024).
Budget granularity: Enforcing separate thresholds per block type or head can prevent catastrophic information loss (Liu et al., 2024, Li et al., 6 Jun 2025).
Adaptive dynamics: Online adaptation of per-modality or per-head quotas via EMA (β ≈ 0.1–0.2) is recommended for nonstationary or heterogeneous inputs (Li et al., 6 Jun 2025).
Attention proxy for importance: Use aggregate attention from a proxy token set—either last few text tokens, or the “post-vision” window—to score retention value (Li et al., 6 Jun 2025, Tu et al., 2024).
Eviction triggers and overhead: Hard eviction occurs when cache size exceeds threshold; kernel overhead for measuring sparsity and scoring remains under 6% of prefill (Tu et al., 2024).

6. Theoretical Guarantees and Generalization

Modality-aware step caching frameworks provide formal bounds on error propagation and efficiency:

Error bound: In VLCache, the maximum per-token reuse error under partial recomputation is bounded in terms of self and propagation errors, remaining below tolerance for $L_{t-1}^{(\ell)}$ 3 fraction of recomputed tokens (Qin et al., 15 Dec 2025).
Optimality: Layer sensitivity profiling and budget-constrained optimization deliver a provable Pareto-rate of compute reduction versus output degradation.
Generality: The blueprint extends beyond vision-text to any modality-paired encoder/decoder or hybrid fusion system, provided modality hashes/caches are tracked separately (Qin et al., 15 Dec 2025).
Non-invasiveness: All methods are inference-time only, require no retraining, and are compatible with existing decoders, solvers, and batch schedulers.

7. Limitations and Open Issues

While modality-aware step caching achieves strong empirical and theoretical support, several limitations persist:

Prompt structure dependence: Effectiveness for cache compression depends on identifiable post-modality segments (e.g., for VL-Cache, well-defined “post-vision” windows). For complex or interleaved input prompts, window selection heuristics may need adaptation (Tu et al., 2024).
Prefill latency and memory: In systems where prefill dominates memory footprint, overall batching remains bottlenecked on uncompressed cache (Tu et al., 2024).
Lossy edge cases: Fine-grained tasks that depend on rare or highly local cross-modal dependencies can incur >2% degradation at aggressive settings. Selection of $L_{t-1}^{(\ell)}$ 4, $L_{t-1}^{(\ell)}$ 5, and $L_{t-1}^{(\ell)}$ 6 is task- and budget-dependent (Tu et al., 2024, Li et al., 6 Jun 2025).
Streaming/continuous inference: Existing frameworks primarily target fixed-prompt batched inference. Extending modality-aware caching to streaming or real-time chunked input remains an area for further work.

Modality-aware step caching, as formalized in SmoothCache, VLCache, VL-Cache, and MadaKV, now constitutes a critical component for practical scaling of large multimodal and generative models, delivering robust memory and latency gains by tightly coupling modality-specific analysis with dynamic step-wise cache reuse and adaptation (Liu et al., 2024, Qin et al., 15 Dec 2025, Tu et al., 2024, Li et al., 6 Jun 2025).