Token-Based Cache Reduction

Updated 26 January 2026

Token-based cache reduction is a set of algorithmic techniques that select, prune, and compress key token representations in Transformer models to lower memory and computational costs.
It employs methods such as attention-score driven pruning, learned cache predictors, redundancy analysis, and dynamic scheduling to identify and retain high-importance tokens.
These approaches achieve significant cache size reduction—up to 90% in some cases—and speedups with minimal impact on model accuracy across various inference tasks.

Token-based cache reduction refers to a collection of algorithmic techniques designed to reduce the computational and memory overhead of neural network inference—particularly in Transformer-based architectures—by selectively retaining, pruning, quantizing, or compressively representing only the most important token-associated states in intermediate or persistent model "caches." This paradigm has been extensively studied in both generative LLMs and diffusion-based generative models, with approaches targeting either the Key/Value (KV) caches in LLMs or the feature maps and attention caches in diffusion transformers and vision transformers. The techniques span token importance scoring, redundancy and similarity analysis, learned or plug-in predictors, composite token construction, dynamic scheduling schemes, direct hardware cache management, and quantization strategies with token-aware precision.

1. Motivation and Fundamental Principles

The memory and computational cost of caching all tokens' intermediate representations—such as key/value states in attention layers—scales linearly with context length and model depth. In inference for LLMs, processing context windows of tens or hundreds of thousands of tokens requires storing and retrieving up to $\mathcal{O}(N d L)$ (where $N$ is sequence length, $d$ hidden size, $L$ layers) activations, leading to GPU memory saturation, reduced throughput, and excessive data movement between compute units and memory. Similarly, in diffusion transformers, iterative denoising steps involve repeated computation and storage of per-token features across multiple blocks, compounding cost due to multi-step nature and quadratic attention complexity (Lou et al., 2024).

Token-based cache reduction aims to minimize these resource demands by

Determining which tokens contribute meaningfully to the network's output, using various importance metrics,
Pruning or otherwise compressing less-impactful tokens from the cache, and
Adapting retention dynamically to task, context, and architectural features to limit quality degradation.

The underlying premise is that the marginal utility of storing all input tokens is limited: a small, well-chosen subset often suffices for downstream predictive accuracy and generative coherence.

2. Token Importance Estimation and Pruning Algorithms

Across both domains (LLMs and generative diffusion models), methods for token selection fall into several algorithmic classes:

(a) Attention-Score-Driven Pruning:

Classical pruning is built around the observation that attention weights represent a natural importance signal. Approaches such as H₂O, Scissorhands, and StreamLLM accumulate or window attention scores to identify "heavy-hitter" tokens (Guo et al., 2024). Value-Aware Token Pruning (VATP) refines this by integrating the $\ell_1$ -norm of each token's value vector for a composite importance score $I_k^t = S_k^t \cdot \|\boldsymbol{v}_k\|_1$ , retaining tokens with the largest scores (Guo et al., 2024). This outperforms attention-only metrics on diverse tasks.

(b) Learned Cache Predictors:

In diffusion transformers, TokenCache leverages a lightweight MLP ("Cache Predictor") trained to output per-token importance scores $w_{l,i}^t$ , using an MSE objective that interpolates between full inference and cached feature reuse. Grid-based selection then identifies tokens with the lowest predicted scores for pruning, and adaptive block selection focuses pruning on multi-block regions deemed least impactful (Lou et al., 2024).

(c) Redundancy and Similarity Analysis:

Methods such as R-KV directly quantify token-level redundancy by computing the cosine similarity of key vectors among candidate tokens, generating a redundancy score $R^h_i$ (normalized softmax of mean similarities). This is combined with importance scoring for joint selection, achieving lossless compression down to $10\%$ of the cache in reasoning models (Cai et al., 30 May 2025). KVCrush uses an efficient binary ("fingerprint") signature of each token's per-head attention behavior and groups evicted tokens by Hamming distance, balancing diversity retention and low overhead (Jha et al., 24 Feb 2025).

(d) Structural Compression Schemes:

HashEvict employs Locality Sensitive Hashing (LSH, SimHash) of query and cached key vectors for pre-attention eviction, replacing keys maximally dissimilar (in hash space) from each new query and achieving 30–70% compression (Liu et al., 2024). KVCompose constructs layer-adaptive "composite tokens" via attention-guided aggregation, assigning per-head, per-token importance scores; a global budget allocator adapts allocation across layers to maximize aggregate retained importance (Akulov et al., 5 Sep 2025).

(e) Tree and Smooth Hierarchical Compression:

TreeKV organizes the token cache as a balanced binary tree, using local attention-weighted retention and a rotation-based eviction scope to enforce smooth, coarse-to-fine granularity from distant past to recent context. This approach, inspired by wavelet analysis, maintains context diversity and outperforms position-only or global-importance approaches (He et al., 9 Jan 2025).

3. Quantization and Token-Aware Precision Strategies

A parallel axis of optimization is aggressive quantization of cache states, where token importance affects precision allocation.

Anchor Token-Aware Quantization: AnTKV computes per-token Anchor Scores (AnS) that quantify the sensitivity of each token's key/value cache to quantization-induced error, preserving a small set of high-AnS tokens in full precision and subjecting the remainder to ultra-low-bit (<1b) sub-vector quantization. This enables up to 10–40× reduction in cache size with minimal perplexity loss, and allows single-GPU context handling up to 840K (Li et al., 24 Jun 2025).
Mixed-Precision with Saliency Heuristics: ZipCache applies channel-separable quantization, normalizing channel outliers before per-token quantization. It uses a normalized accumulated attention score—corrected for lower-triangular bias—for token saliency estimation, maintaining high accuracy at compression ratios near 5× for tasks such as GSM8k and HumanEval (He et al., 2024).

These methods integrate token-pruning and quantization axes, yielding further memory reduction without intolerable accuracy decline.

4. Dynamic Scheduling, Adaptive Layering, and Structural Integration

(a) Decoupled Scheduling:

FastKV finds that token-importance sets stabilize only at later layers. It introduces a decoupled two-stage process: all tokens are processed up to a Token-Selective Propagation (TSP) layer, after which only the top- $\gamma N$ tokens are propagated, with each subsequent layer independently pruning to a fraction $\rho |S|$ . This allows flexible accuracy/efficiency tradeoff unattainable in fixed-layer or single-stage strategies (Jo et al., 3 Feb 2025).

(b) Adaptive Layer Selection:

ASL (Adaptive Selection Layer) dynamically finds the optimal layer to conduct one-shot token selection by monitoring the variance of token importance ranks across a look-back window of layers. When variance drops below a threshold, it is evidence that important tokens have stabilized, and pruning can proceed. ASL outperforms static-layer token selection methods and integrates with SnapKV/GemFilter for further gains (Taniguchi et al., 12 Jan 2026).

(c) Two-Phase Round Robin and Hybrid Streamed Attention:

TokenCache uses a tow-phase Round Robin schedule to alternate between periods of cached-pruned computation and full independent steps, tuning cache intervals in early and late diffusion steps for optimal fidelity/speed balance (Lou et al., 2024). SimLayerKV (LightTransfer) in LLMs performs dynamic identification of "lazy" layers with streaming attention (prefix+window retention), interleaving with full-attention layers to achieve up to 2.0× cache compression at <2% performance loss (Zhang et al., 2024).

5. Practical Impact: Performance, Quality Trade-Offs, and System-Level Integration

The empirical outcomes span:

LLMs: VATP, TreeKV, R-KV, KVCompose, SAGE-KV, and KVCrush commonly achieve 50–90% cache reduction, with typical accuracy losses <1–2% across LongBench, RULER, InfiniteBench, and NIAH (Guo et al., 2024, He et al., 9 Jan 2025, Cai et al., 30 May 2025, Akulov et al., 5 Sep 2025, Wang et al., 11 Mar 2025, Jha et al., 24 Feb 2025). TreeKV supports up to 16× cache reduction with best-in-class perplexity at optimal budgets (6% cache) (He et al., 9 Jan 2025). FastKV and ASL reach speedups up to 2.87× and competitive or better performance on hard retrieval and reasoning benchmarks by adaptively tuning pruning depth (Jo et al., 3 Feb 2025, Taniguchi et al., 12 Jan 2026).
Diffusion Transformers: TokenCache achieves 1.3–1.5× wall-clock speedup on A100 while degrading FID minimally (e.g., full FID 1.86 → TokenCache FID 2.08 at 1.51× speedup) (Lou et al., 2024). DaTo in Stable Diffusion combines feature caching with patch-based, dynamics-aware token pruning, delivering up to 9× acceleration and even better FID due to extended feature dynamics (Zhang et al., 2024).
Resource Allocation in Serving Environments: System-level platforms such as Tokencake and TokenLake exploit fine-grained (token- or segment-level) cache management for multi-agent scheduling and distributed serving. Tokencake uses a hybrid priority-aware scheduler and predictive offload to achieve up to 47% end-to-end latency reductions and ≈17% higher GPU cache occupancy (Bian et al., 21 Oct 2025); TokenLake's segment-level pooling, heavy-hitter load balancing, and deduplication achieve 2.6× throughput and 2.1× hit-rate improvements over leading cache-routing frameworks (Wu et al., 24 Aug 2025).
Quantization: AnTKV and ZipCache demonstrate that quantization is most effective when paired with token-aware importance metrics, often outperforming uniform or groupwise schemes at extreme bitrates (Li et al., 24 Jun 2025, He et al., 2024).

These approaches are generally compatible with inference acceleration frameworks (e.g. FlashAttention), are training-free or require limited tuning, and often plug into existing model code without custom kernels or retraining.

6. Limitations, Ablations, and Future Directions

Commonly identified limitations include:

Potential approximation error and bias in aggressive pruning or hashing-based approaches, especially for tasks requiring long-range, low-instantaneous-attention context (as observed for HashEvict and layer-freezing methods) (Liu et al., 2024, Zhang et al., 2024).
Diminishing returns or sudden accuracy drop when pruning ratios exceed 50–70%, as seen in TokenCache's FID sweeps and ZipCache's ablations (Lou et al., 2024, He et al., 2024).
Sensitivity to hyperparameter selection (e.g., prune rates, window sizes, Gumbel temperature schedules, and quantization bit allocations).
Current methods are less effective or not yet integrated for Grouped-Query Attention or highly multimodal architectures (Guo et al., 2024).

Active areas of research include multi-ary or hierarchical segmentations (TreeKV), hybrid approaches that mix pre-attention and accumulated-attention scoring, dynamic per-layer/per-head cache allocation, ultra-low-precision quantization stabilized by anchor-aware selection, and system integration with paging or offloading schemes.

7. Summary Table of Representative Methods

Method	Core Strategy	Typical Compression	Performance Impact	Reference
VATP	Attention + value-norm	2×	<1–2% task loss	(Guo et al., 2024)
TokenCache	Learned cache predictor	1.5× speedup	FID degradation <0.2	(Lou et al., 2024)
TreeKV	Tree-structured retention	16×	~0.1–0.3 perplexity delta	(He et al., 9 Jan 2025)
R-KV	Redundancy-aware selection	10×	Lossless for reasoning	(Cai et al., 30 May 2025)
HashEvict	LSH, pre-attention eviction	1.4–3.3×	~1–2% loss at 50%	(Liu et al., 2024)
KVCompose	Composite token pooling	up to 10×	AUC +10–20 pts vs baselines	(Akulov et al., 5 Sep 2025)
FastKV	Adaptive TSP layer, per-layer	2–10×	<1% avg loss, faster	(Jo et al., 3 Feb 2025)
AnTKV	Anchor-guided quantization	10–40×	<1–2 perplexity (ultra-lowbit)	(Li et al., 24 Jun 2025)
ZipCache	Token-adaptive quantization	5×	<0.5% loss, fast	(He et al., 2024)
KVCrush	Attention-head fingerprinting	4×	<1% loss	(Jha et al., 24 Feb 2025)
SAGE-KV	Attention-guided one-pass	4×	<0.6 pp avg acc loss	(Wang et al., 11 Mar 2025)
CLCA (ViT)	Cross-layer info recovery	up to 10× (ViT)	Matches SoTA at r=10%	(Rios et al., 2024)
ASL	Adaptive selection layer	2–10×	Outperforms static baselines	(Taniguchi et al., 12 Jan 2026)

For further mathematical details, quantitative metrics, and architecture-specific ablations, see the cited arXiv papers.

Markdown Upgrade to Chat

References (17)

Token Caching for Diffusion Transformer Acceleration (2024)

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters (2024)

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models (2025)

KVCrush: Key value cache size-reduction using similarity in head-behaviour (2025)

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing (2024)

KVCompose: Efficient Structured KV Cache Compression with Composite Tokens (2025)

TreeKV: Smooth Key-Value Cache Compression with Tree Structures (2025)

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models (2025)

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification (2024)

10.

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration (2025)

11.

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference (2026)

12.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation (2024)

13.

LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (2025)

14.

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free (2024)

15.

Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications (2025)

16.

TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving (2025)

17.

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-based Cache Reduction.