Selective Token Retention and Eviction
- Selective Token Retention and Eviction is a strategy that manages memory and computation by preserving critical tokens and discarding non-essential ones.
- It employs score-based, probabilistic, and graph-based methodologies to assess token importance and enforce dynamic retention policies.
- Empirical studies demonstrate improved efficiency and maintained accuracy, with significant memory reduction and throughput gains in various model architectures.
Selective Token Retention and Eviction
Selective token retention and eviction are core mechanisms for controlling computational cost, memory footprint, and information flow in large-scale sequence models, especially in contexts such as long-context language modeling, reasoning, memory-bounded inference, and model unlearning. At a high level, these techniques aim to (1) identify which tokens in an input sequence, an attention window, or a memory cache are most critical to downstream model behavior, and (2) retain, merge, or evict less relevant ones according to learned, statistical, or hybrid policies. This article surveys the foundational approaches, methodologies, empirical findings, and emerging challenges characterizing this field.
1. Formal Principles and Problem Settings
Selective retention and eviction are formulated to optimize the trade-off between model utility and resource efficiency. In transformer-based LLMs and diffusion LMs, each token is associated with hidden states, key-value (KV) cache slots, or state vector representations, whose unbounded growth with sequence length or decode steps can overwhelm hardware constraints.
The general formalization comprises:
- Importance Scoring: Each token is assigned a scalar score via a learned module, statistical proxy (e.g., attention mass), or an externally trained model. For example, TRIM-KV computes a per-layer, per-head retention score via a retention-gate MLP (Bui et al., 3 Dec 2025).
- Retention Policy: A rule or budget , such as top- retention, probabilistic gating (Bernoulli or Hard-Concrete relaxation (Rafiuddin et al., 9 Oct 2025)), thresholding, or graph-based propagation (Li et al., 30 Aug 2025), determines which tokens are preserved or evicted.
- Eviction/Retention Action: Tokens below threshold or rank are either permanently evicted (removed from cache or computation graph), preserved at lower precision (Yang et al., 2024), merged (in similarity-guided fashion (Zhan et al., 2024)), or otherwise temporarily bypassed (e.g., interleaved sparsification (Jo et al., 3 Feb 2026)).
Common use cases include:
- Limiting context or memory in inference (KV cache, attention window)
- Focusing unlearning loss on critical tokens to avoid collateral utility drop (Wan et al., 1 Jun 2025)
- Optimizing throughput in reasoning traces and chain-of-thought generations (Guo et al., 26 Jan 2026, Kim et al., 13 Apr 2026)
- Preventing catastrophic context loss or hallucinations by mixed-precision retention (Yang et al., 2024)
- Ensuring model knowledge is preserved while supporting dynamic or streaming input (Bui et al., 3 Dec 2025, Delena et al., 5 Feb 2025)
2. Methodologies for Token Selection and Scoring
Token selection methodologies fall into several families:
- Score-Based Selection: Per-token scalar scores are obtained from:
- Gate networks (MLPs, linear layers): e.g., TRIM-KV's (Bui et al., 3 Dec 2025), Adaptive Retention’s Bernoulli gate (Rafiuddin et al., 9 Oct 2025)
- Attention mass: e.g., SAGE-KV’s per-head softmax accumulation (Wang et al., 11 Mar 2025)
- Difference of predictive probability from assistant models (as in Selective Unlearning (Wan et al., 1 Jun 2025), where )
- Decision-critical impact (e.g., sum of answer attention for reasoning tokens in DynTS (Guo et al., 26 Jan 2026))
- Local CNNs leveraging adjacent context in hybrid models (laLTE (He et al., 23 Oct 2025))
- Pre-attention proxies (HashEvict's LSH-based cosine dissimilarity (Liu et al., 2024))
- Graph and Similarity Mechanisms:
- GraphKV introduces a graph whose nodes are tokens and edges are weighted by similarity (cosine of keys), introducing decay propagation to penalize redundancy and anchor the decision set (Li et al., 30 Aug 2025).
- Merge-prune intra-layer reduction for SSMs combines importance and cosine similarity between merged token pairs (Zhan et al., 2024).
- Specialized Heuristics:
- Transactional Attention (TA) sponsors tokens adjacent to semantic anchors (e.g., "password:") to protect dormant but crucial content (Basu, 13 Apr 2026).
- Role differentiation and consolidation in reasoning traces (CASK) maintain a core set of critical tokens and merge the rest (Kim et al., 13 Apr 2026).
- Probabilistic and Adaptive Approaches:
- Structured Token Retention employs a retention head per token, followed by adaptive thresholding and hierarchical partitioning (CMP) governed by the variance and mean of gate outputs (Delena et al., 5 Feb 2025).
- Layer-adaptive top- gating with budget constraints, as in Adaptive Retention (Rafiuddin et al., 9 Oct 2025).
3. Core Algorithms for Retention and Eviction
Implementation of retention and eviction integrates scoring, policy, and memory actions:
- Selection: Compute all scores () in batch, select set 0 (e.g., 1, top-2, or groupwise).
- Eviction:
- Hard pruning: Remove evicted tokens from KV cache or mask out in forward pass.
- Mixed-precision retention: Important tokens kept at FP16, less critical entries quantized to INT2/3/4, with channel balancing for quantization error control (Yang et al., 2024).
- Merge-prune/folding: For SSMs and reasoning traces, low-importance tokens are merged with high-similarity partners—preserving mass 3 and averaging vectors—to further condense representation (Zhan et al., 2024, Kim et al., 13 Apr 2026).
- Temporal Retention: Decay-based eviction (e.g., TRIM-KV's 4), local windowing, or recency tail as in hybrid policies (Bui et al., 3 Dec 2025, He et al., 23 Oct 2025).
- No Permanent Eviction Policies: Dynamic per-layer selection without cache pruning (TSA (Jo et al., 3 Feb 2026)), allowing information to reenter selection at later stages.
The following table summarizes representative algorithmic motifs:
| Method | Scoring Principle | Eviction/Retention |
|---|---|---|
| TRIM-KV (Bui et al., 3 Dec 2025) | Gate-MLP, time-decay | Evict below threshold |
| SAGE-KV (Wang et al., 11 Mar 2025) | Per-head attention | Head-level top-5 |
| SU (Wan et al., 1 Jun 2025) | Difference in LM probabilities | Token-based window |
| HashEvict (Liu et al., 2024) | LSH/Hamming to query | Pre-attention drop |
| GraphKV (Li et al., 30 Aug 2025) | Static + signal propagation | Refinement in static |
| CASK (Kim et al., 13 Apr 2026) | Role/core + clustering | Prefix prune + merge |
4. Empirical Performance and Trade-Offs
Quantitative evaluations consistently reveal that selective retention mechanisms drastically reduce computation and memory with minimal impact on answer correctness, utility, or recall:
- TRIM-KV, at 1024 token budget, achieves 44.8% Pass@1 for AIME24 vs. SnapKV 18.7% and even surpasses full-cache accuracy at 4096 budget (Bui et al., 3 Dec 2025).
- In SAGE-KV, a one-time top-6 selection achieves 7 memory efficiency over StreamLLM and 8 over Quest with equivalent or improved accuracy on LongBench (Wang et al., 11 Mar 2025).
- Sparse-dLLM yields up to 9 throughput improvements for diffusion LLMs with no more than 0.5% drop in accuracy, leveraging cross-layer and temporal stability in attention to safely reduce token retention (Song et al., 4 Aug 2025).
- Structured Token Retention with CMP improves token survival rates up to 30-40 pp over baseline, reducing error propagation and memory by 15–20% (Delena et al., 5 Feb 2025).
- Adaptive Retention retains 95%+ of full-model accuracy using only 30–50% tokens, cutting peak memory by 035–45% and providing up to 1 speedup over dense models (Rafiuddin et al., 9 Oct 2025).
- In PromptDistill, query-based intermediate-layer retention improves efficiency and gives up to 5% accuracy improvement over fixed-window approaches without retraining (Jin et al., 30 Mar 2025).
Qualitative analysis shows that retention policies often recover emergent patterns such as attention sinks, sliding windows, and gist compression present in hand-crafted heuristics (Bui et al., 3 Dec 2025). In rare scenarios with extremely diffuse information (e.g., long tables), hard eviction can lose critical details (Yang et al., 2024).
5. Special Considerations: Reasoning, Unlearning, and Robustness
Selective token retention and eviction play pivotal roles in:
- Reasoning Traces: In multi-step logical or mathematical reasoning, only a subset of tokens (the "decision-critical" or "core" tokens) steer the final answer, while many substeps are redundant. Methods such as DynTS (Guo et al., 26 Jan 2026) and CASK (Kim et al., 13 Apr 2026) identify and protect this core, enable groupwise folding for redundancy, and prevent collapse of inferential chains.
- Unlearning and Privacy: Selective Unlearning (SU (Wan et al., 1 Jun 2025)) demonstrates that only a small subset of tokens in forget requests truly distinguishes unwanted from retained knowledge; restricting the forgetting loss to these minimizes collateral degradation.
- Dormant Tokens and Semantic Signals: Transactional Attention (TA (Basu, 13 Apr 2026)) addresses the "dormant token" failure mode, in which tokens with low attention mass (e.g., credentials) are nonetheless critical to retrieval, by leveraging anchor-sponsorship masks that cannot be captured by statistical scoring alone.
- Mixed-Precision Retention and Safety: MiKV (Yang et al., 2024) shows that mixed-precision retention, wherein important tokens remain at high precision and evicted tokens are quantized, can avoid the severe quality and safety degradation observed from pure eviction policies.
6. Limitations, Pathologies, and Outlook
Current literature highlights important limitations:
- Over-eviction Risk: Hyperparameter mis-tuning may remove semantically vital tokens, especially if importance metrics overfit to frequency or training domain (Delena et al., 5 Feb 2025).
- Positional Coherence: Non-contiguous eviction strategies, especially when context window approaches model architectural limits, can scramble position encodings (e.g., RoPE), causing degenerative outputs despite high retention ratios (Poudel, 23 Oct 2025). Contiguity or structured windowing (gist blocks) is preferable near limit.
- Scorer Stagnation: Excessively sophisticated scoring functions rarely outperform simple policies beyond a moderate threshold; structured consolidation and explicit core/scratch decomposition (as in CASK (Kim et al., 13 Apr 2026)) yield larger fidelity gains.
- Compatibility and Extension: Efficient implementation requires per-head, per-layer adaptation and alignment with fused attention kernels. Mixed-precision and grouping introduce new kernel and API requirements (Yang et al., 2024).
Ongoing research explores dynamic, composable policies; hybrid attention-statistics–semantic sponsorship; joint pretraining of retention scores; and adaptation to multimodal, streaming, or online unlearning scenarios.
7. Comparative Summary of Key Techniques
| Approach | Main Mechanism | Distinctive Features | Principal Limitation | Papers |
|---|---|---|---|---|
| SU | Targeted unlearning loss | Token-level selection, n-gram/LLM assist | Overhead for assistant training | (Wan et al., 1 Jun 2025) |
| TRIM-KV | Retention-gate per head | Exponential time-decay, distillation | Per-head granularity, unlearned joint training | (Bui et al., 3 Dec 2025) |
| GraphKV | Graph + decay propagation | Penalizes redundancy, plug compatible | Graph formation cost | (Li et al., 30 Aug 2025) |
| SAGE-KV | Self-attn head scoring | One-time, per-head top-2 | Irreversible after prefill, ignores recency | (Wang et al., 11 Mar 2025) |
| Structured Retention | Sigmoid head + adapt. thresh. | Probabilistic, multi-partition CMP | Overhead for small 3, risk of overfit | (Delena et al., 5 Feb 2025) |
| PromptDistill | Last-token–to–all-query | Retain intermediate cache, multi-stage | Requires empirical choice of selection depths | (Jin et al., 30 Mar 2025) |
| CASK | Core/scratch decomposition | Fold redundant regions, prefix slack | Oracle mass for core selection, tuning group size | (Kim et al., 13 Apr 2026) |
| Transactional Attn | Structural anchor sponsors | Defends dormant/credential tokens | Manual anchor design, not general to all tasks | (Basu, 13 Apr 2026) |
| MiKV | Mixed-precision retention | Retained in INT2/3/4, high-rank in FP16 | May lose rare details if both quantized/evicted | (Yang et al., 2024) |
The field of selective token retention and eviction thus encompasses a spectrum of formally principled and carefully engineered solutions—enabling long-context and memory-bounded models to scale efficiently, maintain fidelity, and perform targeted unlearning or privacy repair, while ensuring system robustness and quality across diverse use cases and workloads.