Papers
Topics
Authors
Recent
Search
2000 character limit reached

Selective Token Caching Explained

Updated 10 March 2026
  • Selective token caching is an inference strategy that dynamically identifies and retains critical tokens while pruning less informative ones to optimize transformer efficiency.
  • It leverages attention scores, learned predictors, and clustering methods to achieve significant speedups and memory savings across language, vision, and diffusion models.
  • Practical implementations balance computational savings with accuracy, reporting up to 23.8× acceleration and substantial memory reduction without retraining.

Selective token caching refers to any inference-time strategy that reduces the memory and computation footprint of transformer-based neural models by caching, pruning, skipping, or re-computing only a dynamically chosen subset of tokens’ intermediate activations or key-value (KV) cache entries across layers, timesteps, or requests. Unlike naive block-based or full-sequence caching, selective token caching employs analytical, learned, or rule-based methods to identify—at run time—which tokens are most critical for prediction, attention, or downstream reasoning, and aggressively reuses, compresses, or discards the remainder. This paradigm is central to scaling language, vision, diffusion, and world simulation models in resource-constrained environments and ultra-long-context scenarios, and encompasses a broad range of recent algorithmic developments in both language modeling and generative modeling.

1. Fundamental Principles and Objectives

The main objective of selective token caching is to reduce the asymptotic memory and computation complexity inherent in transformer architectures, which otherwise scale linearly (memory) or quadratically (attention FLOPs) with context length or number of input tokens. In canonical autoregressive LLMs and diffusion transformers, all prior tokens’ KV states or feature maps are preserved for each new inference step, incurring costs that rapidly become prohibitive as context grows.

Selective token caching strategies aim to:

  • Identify non-uniform token-level contributions to model outputs, exploiting empirical sparsity or locality in the attention, hidden, or feature dynamics.
  • Retain or recompute only tokens deemed “critical” by task-specific or model-specific metrics—such as attention weight, contextual saliency, predictive dynamics, or learned importance scores.
  • Support adaptive cache eviction, feature reuse, or surrogate prediction on non-critical (“redundant” or “static”) tokens to cap memory usage and accelerate inference.
  • Control accuracy-speed-memory trade-offs with user-settable hyperparameters or online policies.

This approach is deployed both in long-context LLM inference via selective KV cache management (Lou et al., 2024, Wu et al., 2024, Akhauri et al., 10 Mar 2025, Goel et al., 18 Apr 2025, Guo et al., 26 Jan 2026, Wang et al., 18 Dec 2025), and in vision and diffusion models via feature token selection, prediction, or clustering (Zou et al., 2024, Zhang et al., 2024, Liu et al., 26 May 2025, Qin et al., 26 Dec 2025, Cao et al., 19 Dec 2025, Zheng et al., 12 Sep 2025, Feng et al., 6 Mar 2026, Zuo et al., 28 Jan 2026).

2. Mechanisms and Algorithms for Token Selection

Attention-Derived Importance

  • Score-based selection: Methods such as TokenSelect (Wu et al., 2024) compute per-token importance sjs_j by aggregating per-head Query-Key dot products, using a head-wise softmax-aggregation to avoid dominance by single heads. Top-kk tokens thus identified are retained for attention, and the rest are omitted, yielding up to 23.8×23.8\times speedup with minimal degradation.
  • Eviction error minimization: CAOTE (Goel et al., 18 Apr 2025) defines the precise L2L_2 deviation in the attention output induced by evicting token jj:

cj=αj1αjXattnvj2c_j = \frac{\alpha_j}{1-\alpha_j} \| X_{\text{attn}} - v_j \|_2

with αj\alpha_j the attention score and vjv_j the value vector. The token with minimal cjc_j is evicted, guaranteeing close control of attention output drift.

Learned or Predictive Importance

  • Predictor-based selection: TokenButler [(Akhauri et al., 10 Mar 2025), abstract] and DynTS (Guo et al., 26 Jan 2026) attach auxiliary predictors (small MLPs) to estimate the future influence of each token, either on next-token generation quality or on final answer accuracy (reasoning tasks). DynTS, for example, retains only the top-k scored “decision-critical” tokens from long reasoning traces, maintaining full performance at 1.6–1.9× speedup and >3× memory reduction.

Feature and Dynamics Metrics

  • Temporal token “dynamics”: DaTo (Zhang et al., 2024) and FastCache (Liu et al., 26 May 2025) prune tokens whose feature change htht12\|\mathbf h_t - \mathbf h_{t-1}\|_2 is below a threshold (“static” tokens) and only process high-dynamic tokens. DaTo achieves up to 9×9\times acceleration with improved or maintained sample quality.
  • Curvature and multidimensional dynamics: WorldCache (Feng et al., 6 Mar 2026) computes per-token “curvature” from discrete accelerations. Stable tokens are zero-order cached, linear tokens linearly extrapolated, and chaotic tokens are handled with Hermite-blended velocity predictors. Adaptive skipping is triggered only when dimensionless, normalized drift exceeds a threshold, sustaining up to 3.7×3.7\times acceleration at >98>98\% quality, even in multi-modal world models.

Clustering and Spatial Grouping

  • Cluster-driven feature caching: ClusCa (Zheng et al., 12 Sep 2025) clusters spatial tokens (e.g. via k-means), recomputes only one token per cluster, and broadcasts updates by propagating representative token features through the cluster. This yields \sim90\% reduction in per-layer token computation with negligible or positive impacts on image/video generation quality.

3. Temporal, Spatial, and Architectural Axes of Caching

Selective token caching is applied across several orthogonal axes:

4. Practical Implementations and Quantitative Impact

A variety of implementations combine analytic scores, learned predictions, and scheduler logic:

  • Paged/KV chunk caching: For long-context LLMs, paged KV-caches and block-aligned chunk layouts (MEPIC (Wang et al., 18 Dec 2025)) allow cross-request, position-independent cache reuse with single-block recomputation and RoPE-fused position reconstruction, reducing HBM usage by up to 5×5\times at identical accuracy.
  • Cache predictors and superposed forward passes: TokenCache (Lou et al., 2024) and ToCa (Zou et al., 2024) integrate lightweight networks into the inference loop to provide real-time, differentiable token importance or “caching scores.”
  • Statistical and hypothesis tests: FastCache (Liu et al., 26 May 2025) applies per-block chi-squared tests of feature change to limit error accumulation, combining learned linear surrogate updates with explicit statistical guarantees on the approximation error.

Table: Selective Token Caching—Empirical Results

Method Domain Acceleration Quality Penalty
TokenSelect(Wu et al., 2024) LLMs (KV) up to 23.8× attn negligible, <1%
CAOTE(Goel et al., 18 Apr 2025) LLMs (KV) 2× cache uniform ↑ accuracy
DaTo(Zhang et al., 2024) SD diffusion 9× SD, 2.32× SDXL ΔFID −0.33…−2.17
ClusCa(Zheng et al., 12 Sep 2025) DiT, video 4.96× FLUX, >4× IM IR +0.51%, VBench −0.7
ProCache(Cao et al., 19 Dec 2025) DiT, PixArt 2.9× DiT FID +0.53 (vs. vanilla)
WorldCache(Feng et al., 6 Mar 2026) World Models 3.7× >98% WorldScore
MEPIC(Wang et al., 18 Dec 2025) LLM (serving) 2–5× HBM save ±0,

These empirical gains are realized without retraining, preserve or even improve generation quality under moderate to high compression, and scale to hundreds of thousands or millions of tokens.

5. Limitations, Failure Modes, and Trade-offs

Selective token caching introduces nontrivial trade-offs:

  • Aggressive pruning or stale caches: Excessive pruning (e.g., >70% tokens) or infrequent cache refreshes may cause feature drift, loss of critical information, or uncorrectable errors in downstream layers (ToCa (Zou et al., 2024), DaTo (Zhang et al., 2024)).
  • Dynamic context, rare tokens, or highly entangled patterns: In highly dynamic queries or tasks requiring dense, global context (retrieval, co-reference, reasoning), simple importance proxies may underselect critical tokens; learned predictors help, but domain adaptation issues arise (DynTS (Guo et al., 26 Jan 2026)).
  • Overhead of dynamic scheduling: Scheduling, clustering, or dynamic masking imposes additional CPU/GPU latency and complexity; however, empirical cost is typically offset by the compute savings at large scale (ClusCa (Zheng et al., 12 Sep 2025), TokenSelect (Wu et al., 2024)).
  • Opaqueness and deployability: Predictive or attention-based selectors expose new axes for interpretability challenges; fine-tuning and integration with serving stacks (e.g., paged storage, position alignment) require nontrivial engineering (MEPIC (Wang et al., 18 Dec 2025)).

Selective token caching represents a convergence of ideas from structured memory management, attention sparsification, and compressive transformer architectures:

  • It is closely related to attention-pruning, memory compaction, prefix-chunk sharing, efficient prefill/decoding, and adaptive computation graphs.
  • Emerging directions include end-to-end jointly trained token selectors (LAC (Wei et al., 31 Jan 2026)), reinforcement learning over cache scheduling (suggested in DynTS (Guo et al., 26 Jan 2026)), cross-modality context retention rules (WorldCache (Feng et al., 6 Mar 2026)), and chunk or region-level edit propagation in diffusion editing (Qin et al., 26 Dec 2025).
  • Research continues on correlation between attention-derived token importance and model robustness or generalization, as well as extensions to encoder-decoder, transducer, or multi-modal architectures.

Selective token caching forms the backbone of modern, scalable inference for state-of-the-art language, vision, and simulation models, offering principled and empirically validated tools for cost-effective deployment at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Token Caching.