Selective Token Caching Explained
- Selective token caching is an inference strategy that dynamically identifies and retains critical tokens while pruning less informative ones to optimize transformer efficiency.
- It leverages attention scores, learned predictors, and clustering methods to achieve significant speedups and memory savings across language, vision, and diffusion models.
- Practical implementations balance computational savings with accuracy, reporting up to 23.8× acceleration and substantial memory reduction without retraining.
Selective token caching refers to any inference-time strategy that reduces the memory and computation footprint of transformer-based neural models by caching, pruning, skipping, or re-computing only a dynamically chosen subset of tokens’ intermediate activations or key-value (KV) cache entries across layers, timesteps, or requests. Unlike naive block-based or full-sequence caching, selective token caching employs analytical, learned, or rule-based methods to identify—at run time—which tokens are most critical for prediction, attention, or downstream reasoning, and aggressively reuses, compresses, or discards the remainder. This paradigm is central to scaling language, vision, diffusion, and world simulation models in resource-constrained environments and ultra-long-context scenarios, and encompasses a broad range of recent algorithmic developments in both language modeling and generative modeling.
1. Fundamental Principles and Objectives
The main objective of selective token caching is to reduce the asymptotic memory and computation complexity inherent in transformer architectures, which otherwise scale linearly (memory) or quadratically (attention FLOPs) with context length or number of input tokens. In canonical autoregressive LLMs and diffusion transformers, all prior tokens’ KV states or feature maps are preserved for each new inference step, incurring costs that rapidly become prohibitive as context grows.
Selective token caching strategies aim to:
- Identify non-uniform token-level contributions to model outputs, exploiting empirical sparsity or locality in the attention, hidden, or feature dynamics.
- Retain or recompute only tokens deemed “critical” by task-specific or model-specific metrics—such as attention weight, contextual saliency, predictive dynamics, or learned importance scores.
- Support adaptive cache eviction, feature reuse, or surrogate prediction on non-critical (“redundant” or “static”) tokens to cap memory usage and accelerate inference.
- Control accuracy-speed-memory trade-offs with user-settable hyperparameters or online policies.
This approach is deployed both in long-context LLM inference via selective KV cache management (Lou et al., 2024, Wu et al., 2024, Akhauri et al., 10 Mar 2025, Goel et al., 18 Apr 2025, Guo et al., 26 Jan 2026, Wang et al., 18 Dec 2025), and in vision and diffusion models via feature token selection, prediction, or clustering (Zou et al., 2024, Zhang et al., 2024, Liu et al., 26 May 2025, Qin et al., 26 Dec 2025, Cao et al., 19 Dec 2025, Zheng et al., 12 Sep 2025, Feng et al., 6 Mar 2026, Zuo et al., 28 Jan 2026).
2. Mechanisms and Algorithms for Token Selection
Attention-Derived Importance
- Score-based selection: Methods such as TokenSelect (Wu et al., 2024) compute per-token importance by aggregating per-head Query-Key dot products, using a head-wise softmax-aggregation to avoid dominance by single heads. Top- tokens thus identified are retained for attention, and the rest are omitted, yielding up to speedup with minimal degradation.
- Eviction error minimization: CAOTE (Goel et al., 18 Apr 2025) defines the precise deviation in the attention output induced by evicting token :
with the attention score and the value vector. The token with minimal is evicted, guaranteeing close control of attention output drift.
Learned or Predictive Importance
- Predictor-based selection: TokenButler [(Akhauri et al., 10 Mar 2025), abstract] and DynTS (Guo et al., 26 Jan 2026) attach auxiliary predictors (small MLPs) to estimate the future influence of each token, either on next-token generation quality or on final answer accuracy (reasoning tasks). DynTS, for example, retains only the top-k scored “decision-critical” tokens from long reasoning traces, maintaining full performance at 1.6–1.9× speedup and >3× memory reduction.
Feature and Dynamics Metrics
- Temporal token “dynamics”: DaTo (Zhang et al., 2024) and FastCache (Liu et al., 26 May 2025) prune tokens whose feature change is below a threshold (“static” tokens) and only process high-dynamic tokens. DaTo achieves up to acceleration with improved or maintained sample quality.
- Curvature and multidimensional dynamics: WorldCache (Feng et al., 6 Mar 2026) computes per-token “curvature” from discrete accelerations. Stable tokens are zero-order cached, linear tokens linearly extrapolated, and chaotic tokens are handled with Hermite-blended velocity predictors. Adaptive skipping is triggered only when dimensionless, normalized drift exceeds a threshold, sustaining up to acceleration at \% quality, even in multi-modal world models.
Clustering and Spatial Grouping
- Cluster-driven feature caching: ClusCa (Zheng et al., 12 Sep 2025) clusters spatial tokens (e.g. via k-means), recomputes only one token per cluster, and broadcasts updates by propagating representative token features through the cluster. This yields 90\% reduction in per-layer token computation with negligible or positive impacts on image/video generation quality.
3. Temporal, Spatial, and Architectural Axes of Caching
Selective token caching is applied across several orthogonal axes:
- Temporal locality: Caches or surrogates are reused across multiple timesteps, with policies ranging from uniform round-robin to constraint-optimized, non-uniform schedules (ProCache (Cao et al., 19 Dec 2025), SpotEdit (Qin et al., 26 Dec 2025), Window-Diffusion (Zuo et al., 28 Jan 2026)).
- Spatial granularity: Pruning, grouping, or caching is performed at (1) fine token level (as in ToCa (Zou et al., 2024), DaTo (Zhang et al., 2024)), (2) clusters/patches (ClusCa (Zheng et al., 12 Sep 2025)), or (3) chunk/block regions (MEPIC (Wang et al., 18 Dec 2025)).
- Hierarchical model depth: Advanced methods (ToCa (Zou et al., 2024), TokenCache (Lou et al., 2024), ProCache (Cao et al., 19 Dec 2025)) allow independent caching ratios per layer, model depth, or block type (e.g., more aggressive caching in deeper or text-unconditioned layers).
- Modality-aware selection: WorldCache (Feng et al., 6 Mar 2026) and LAC (Wei et al., 31 Jan 2026) adapt the policy to heterogeneous visual, depth, or language actions in multi-modal or vision-language-action models.
4. Practical Implementations and Quantitative Impact
A variety of implementations combine analytic scores, learned predictions, and scheduler logic:
- Paged/KV chunk caching: For long-context LLMs, paged KV-caches and block-aligned chunk layouts (MEPIC (Wang et al., 18 Dec 2025)) allow cross-request, position-independent cache reuse with single-block recomputation and RoPE-fused position reconstruction, reducing HBM usage by up to at identical accuracy.
- Cache predictors and superposed forward passes: TokenCache (Lou et al., 2024) and ToCa (Zou et al., 2024) integrate lightweight networks into the inference loop to provide real-time, differentiable token importance or “caching scores.”
- Statistical and hypothesis tests: FastCache (Liu et al., 26 May 2025) applies per-block chi-squared tests of feature change to limit error accumulation, combining learned linear surrogate updates with explicit statistical guarantees on the approximation error.
Table: Selective Token Caching—Empirical Results
| Method | Domain | Acceleration | Quality Penalty |
|---|---|---|---|
| TokenSelect(Wu et al., 2024) | LLMs (KV) | up to 23.8× attn | negligible, <1% |
| CAOTE(Goel et al., 18 Apr 2025) | LLMs (KV) | 2× cache | uniform ↑ accuracy |
| DaTo(Zhang et al., 2024) | SD diffusion | 9× SD, 2.32× SDXL | ΔFID −0.33…−2.17 |
| ClusCa(Zheng et al., 12 Sep 2025) | DiT, video | 4.96× FLUX, >4× IM | IR +0.51%, VBench −0.7 |
| ProCache(Cao et al., 19 Dec 2025) | DiT, PixArt | 2.9× DiT | FID +0.53 (vs. vanilla) |
| WorldCache(Feng et al., 6 Mar 2026) | World Models | 3.7× | >98% WorldScore |
| MEPIC(Wang et al., 18 Dec 2025) | LLM (serving) | 2–5× HBM save | ±0, |
These empirical gains are realized without retraining, preserve or even improve generation quality under moderate to high compression, and scale to hundreds of thousands or millions of tokens.
5. Limitations, Failure Modes, and Trade-offs
Selective token caching introduces nontrivial trade-offs:
- Aggressive pruning or stale caches: Excessive pruning (e.g., >70% tokens) or infrequent cache refreshes may cause feature drift, loss of critical information, or uncorrectable errors in downstream layers (ToCa (Zou et al., 2024), DaTo (Zhang et al., 2024)).
- Dynamic context, rare tokens, or highly entangled patterns: In highly dynamic queries or tasks requiring dense, global context (retrieval, co-reference, reasoning), simple importance proxies may underselect critical tokens; learned predictors help, but domain adaptation issues arise (DynTS (Guo et al., 26 Jan 2026)).
- Overhead of dynamic scheduling: Scheduling, clustering, or dynamic masking imposes additional CPU/GPU latency and complexity; however, empirical cost is typically offset by the compute savings at large scale (ClusCa (Zheng et al., 12 Sep 2025), TokenSelect (Wu et al., 2024)).
- Opaqueness and deployability: Predictive or attention-based selectors expose new axes for interpretability challenges; fine-tuning and integration with serving stacks (e.g., paged storage, position alignment) require nontrivial engineering (MEPIC (Wang et al., 18 Dec 2025)).
6. Connections to Related Paradigms and Future Directions
Selective token caching represents a convergence of ideas from structured memory management, attention sparsification, and compressive transformer architectures:
- It is closely related to attention-pruning, memory compaction, prefix-chunk sharing, efficient prefill/decoding, and adaptive computation graphs.
- Emerging directions include end-to-end jointly trained token selectors (LAC (Wei et al., 31 Jan 2026)), reinforcement learning over cache scheduling (suggested in DynTS (Guo et al., 26 Jan 2026)), cross-modality context retention rules (WorldCache (Feng et al., 6 Mar 2026)), and chunk or region-level edit propagation in diffusion editing (Qin et al., 26 Dec 2025).
- Research continues on correlation between attention-derived token importance and model robustness or generalization, as well as extensions to encoder-decoder, transducer, or multi-modal architectures.
Selective token caching forms the backbone of modern, scalable inference for state-of-the-art language, vision, and simulation models, offering principled and empirically validated tools for cost-effective deployment at scale.