Papers
Topics
Authors
Recent
Search
2000 character limit reached

Selective Token Retention and Eviction

Updated 24 April 2026
  • Selective Token Retention and Eviction is a strategy that manages memory and computation by preserving critical tokens and discarding non-essential ones.
  • It employs score-based, probabilistic, and graph-based methodologies to assess token importance and enforce dynamic retention policies.
  • Empirical studies demonstrate improved efficiency and maintained accuracy, with significant memory reduction and throughput gains in various model architectures.

Selective Token Retention and Eviction

Selective token retention and eviction are core mechanisms for controlling computational cost, memory footprint, and information flow in large-scale sequence models, especially in contexts such as long-context language modeling, reasoning, memory-bounded inference, and model unlearning. At a high level, these techniques aim to (1) identify which tokens in an input sequence, an attention window, or a memory cache are most critical to downstream model behavior, and (2) retain, merge, or evict less relevant ones according to learned, statistical, or hybrid policies. This article surveys the foundational approaches, methodologies, empirical findings, and emerging challenges characterizing this field.

1. Formal Principles and Problem Settings

Selective retention and eviction are formulated to optimize the trade-off between model utility and resource efficiency. In transformer-based LLMs and diffusion LMs, each token is associated with hidden states, key-value (KV) cache slots, or state vector representations, whose unbounded growth with sequence length or decode steps can overwhelm hardware constraints.

The general formalization comprises:

  • Importance Scoring: Each token tit_i is assigned a scalar score sis_i via a learned module, statistical proxy (e.g., attention mass), or an externally trained model. For example, TRIM-KV computes a per-layer, per-head retention score βt,â„“,h\beta_{t,\ell,h} via a retention-gate MLP (Bui et al., 3 Dec 2025).
  • Retention Policy: A rule or budget BB, such as top-kk retention, probabilistic gating (Bernoulli or Hard-Concrete relaxation (Rafiuddin et al., 9 Oct 2025)), thresholding, or graph-based propagation (Li et al., 30 Aug 2025), determines which tokens are preserved or evicted.
  • Eviction/Retention Action: Tokens below threshold or rank are either permanently evicted (removed from cache or computation graph), preserved at lower precision (Yang et al., 2024), merged (in similarity-guided fashion (Zhan et al., 2024)), or otherwise temporarily bypassed (e.g., interleaved sparsification (Jo et al., 3 Feb 2026)).

Common use cases include:

2. Methodologies for Token Selection and Scoring

Token selection methodologies fall into several families:

  • Score-Based Selection: Per-token scalar scores are obtained from:
    • Gate networks (MLPs, linear layers): e.g., TRIM-KV's βt,â„“,h\beta_{t,\ell,h} (Bui et al., 3 Dec 2025), Adaptive Retention’s Bernoulli gate gi(â„“)∼Bernoulli(pi(â„“))g_i^{(\ell)}\sim\mathrm{Bernoulli}(p_i^{(\ell)}) (Rafiuddin et al., 9 Oct 2025)
    • Attention mass: e.g., SAGE-KV’s per-head softmax accumulation (Wang et al., 11 Mar 2025)
    • Difference of predictive probability from assistant models (as in Selective Unlearning (Wan et al., 1 Jun 2025), where si=∣pθ1(ti∣t<i)−pθ2(ti∣t<i)∣s_i = |p_{\theta^1}(t_i|t_{<i}) - p_{\theta^2}(t_i|t_{<i})|)
    • Decision-critical impact (e.g., sum of answer attention for reasoning tokens in DynTS (Guo et al., 26 Jan 2026))
    • Local CNNs leveraging adjacent context in hybrid models (laLTE (He et al., 23 Oct 2025))
    • Pre-attention proxies (HashEvict's LSH-based cosine dissimilarity (Liu et al., 2024))
  • Graph and Similarity Mechanisms:
    • GraphKV introduces a graph whose nodes are tokens and edges are weighted by similarity (cosine of keys), introducing decay propagation to penalize redundancy and anchor the decision set (Li et al., 30 Aug 2025).
    • Merge-prune intra-layer reduction for SSMs combines importance and cosine similarity between merged token pairs (Zhan et al., 2024).
  • Specialized Heuristics:
    • Transactional Attention (TA) sponsors tokens adjacent to semantic anchors (e.g., "password:") to protect dormant but crucial content (Basu, 13 Apr 2026).
    • Role differentiation and consolidation in reasoning traces (CASK) maintain a core set of critical tokens and merge the rest (Kim et al., 13 Apr 2026).
  • Probabilistic and Adaptive Approaches:
    • Structured Token Retention employs a retention head per token, followed by adaptive thresholding and hierarchical partitioning (CMP) governed by the variance and mean of gate outputs (Delena et al., 5 Feb 2025).
    • Layer-adaptive top-MM gating with budget constraints, as in Adaptive Retention (Rafiuddin et al., 9 Oct 2025).

3. Core Algorithms for Retention and Eviction

Implementation of retention and eviction integrates scoring, policy, and memory actions:

  • Selection: Compute all scores (sis_i) in batch, select set sis_i0 (e.g., sis_i1, top-sis_i2, or groupwise).
  • Eviction:
    • Hard pruning: Remove evicted tokens from KV cache or mask out in forward pass.
    • Mixed-precision retention: Important tokens kept at FP16, less critical entries quantized to INT2/3/4, with channel balancing for quantization error control (Yang et al., 2024).
    • Merge-prune/folding: For SSMs and reasoning traces, low-importance tokens are merged with high-similarity partners—preserving mass sis_i3 and averaging vectors—to further condense representation (Zhan et al., 2024, Kim et al., 13 Apr 2026).
  • Temporal Retention: Decay-based eviction (e.g., TRIM-KV's sis_i4), local windowing, or recency tail as in hybrid policies (Bui et al., 3 Dec 2025, He et al., 23 Oct 2025).
  • No Permanent Eviction Policies: Dynamic per-layer selection without cache pruning (TSA (Jo et al., 3 Feb 2026)), allowing information to reenter selection at later stages.

The following table summarizes representative algorithmic motifs:

Method Scoring Principle Eviction/Retention
TRIM-KV (Bui et al., 3 Dec 2025) Gate-MLP, time-decay Evict below threshold
SAGE-KV (Wang et al., 11 Mar 2025) Per-head attention Head-level top-sis_i5
SU (Wan et al., 1 Jun 2025) Difference in LM probabilities Token-based window
HashEvict (Liu et al., 2024) LSH/Hamming to query Pre-attention drop
GraphKV (Li et al., 30 Aug 2025) Static + signal propagation Refinement in static
CASK (Kim et al., 13 Apr 2026) Role/core + clustering Prefix prune + merge

4. Empirical Performance and Trade-Offs

Quantitative evaluations consistently reveal that selective retention mechanisms drastically reduce computation and memory with minimal impact on answer correctness, utility, or recall:

  • TRIM-KV, at 1024 token budget, achieves 44.8% Pass@1 for AIME24 vs. SnapKV 18.7% and even surpasses full-cache accuracy at 4096 budget (Bui et al., 3 Dec 2025).
  • In SAGE-KV, a one-time top-sis_i6 selection achieves sis_i7 memory efficiency over StreamLLM and sis_i8 over Quest with equivalent or improved accuracy on LongBench (Wang et al., 11 Mar 2025).
  • Sparse-dLLM yields up to sis_i9 throughput improvements for diffusion LLMs with no more than 0.5% drop in accuracy, leveraging cross-layer and temporal stability in attention to safely reduce token retention (Song et al., 4 Aug 2025).
  • Structured Token Retention with CMP improves token survival rates up to 30-40 pp over baseline, reducing error propagation and memory by 15–20% (Delena et al., 5 Feb 2025).
  • Adaptive Retention retains 95%+ of full-model accuracy using only 30–50% tokens, cutting peak memory by βt,â„“,h\beta_{t,\ell,h}035–45% and providing up to βt,â„“,h\beta_{t,\ell,h}1 speedup over dense models (Rafiuddin et al., 9 Oct 2025).
  • In PromptDistill, query-based intermediate-layer retention improves efficiency and gives up to 5% accuracy improvement over fixed-window approaches without retraining (Jin et al., 30 Mar 2025).

Qualitative analysis shows that retention policies often recover emergent patterns such as attention sinks, sliding windows, and gist compression present in hand-crafted heuristics (Bui et al., 3 Dec 2025). In rare scenarios with extremely diffuse information (e.g., long tables), hard eviction can lose critical details (Yang et al., 2024).

5. Special Considerations: Reasoning, Unlearning, and Robustness

Selective token retention and eviction play pivotal roles in:

  • Reasoning Traces: In multi-step logical or mathematical reasoning, only a subset of tokens (the "decision-critical" or "core" tokens) steer the final answer, while many substeps are redundant. Methods such as DynTS (Guo et al., 26 Jan 2026) and CASK (Kim et al., 13 Apr 2026) identify and protect this core, enable groupwise folding for redundancy, and prevent collapse of inferential chains.
  • Unlearning and Privacy: Selective Unlearning (SU (Wan et al., 1 Jun 2025)) demonstrates that only a small subset of tokens in forget requests truly distinguishes unwanted from retained knowledge; restricting the forgetting loss to these minimizes collateral degradation.
  • Dormant Tokens and Semantic Signals: Transactional Attention (TA (Basu, 13 Apr 2026)) addresses the "dormant token" failure mode, in which tokens with low attention mass (e.g., credentials) are nonetheless critical to retrieval, by leveraging anchor-sponsorship masks that cannot be captured by statistical scoring alone.
  • Mixed-Precision Retention and Safety: MiKV (Yang et al., 2024) shows that mixed-precision retention, wherein important tokens remain at high precision and evicted tokens are quantized, can avoid the severe quality and safety degradation observed from pure eviction policies.

6. Limitations, Pathologies, and Outlook

Current literature highlights important limitations:

  • Over-eviction Risk: Hyperparameter mis-tuning may remove semantically vital tokens, especially if importance metrics overfit to frequency or training domain (Delena et al., 5 Feb 2025).
  • Positional Coherence: Non-contiguous eviction strategies, especially when context window approaches model architectural limits, can scramble position encodings (e.g., RoPE), causing degenerative outputs despite high retention ratios (Poudel, 23 Oct 2025). Contiguity or structured windowing (gist blocks) is preferable near limit.
  • Scorer Stagnation: Excessively sophisticated scoring functions rarely outperform simple policies beyond a moderate threshold; structured consolidation and explicit core/scratch decomposition (as in CASK (Kim et al., 13 Apr 2026)) yield larger fidelity gains.
  • Compatibility and Extension: Efficient implementation requires per-head, per-layer adaptation and alignment with fused attention kernels. Mixed-precision and grouping introduce new kernel and API requirements (Yang et al., 2024).

Ongoing research explores dynamic, composable policies; hybrid attention-statistics–semantic sponsorship; joint pretraining of retention scores; and adaptation to multimodal, streaming, or online unlearning scenarios.

7. Comparative Summary of Key Techniques

Approach Main Mechanism Distinctive Features Principal Limitation Papers
SU Targeted unlearning loss Token-level selection, n-gram/LLM assist Overhead for assistant training (Wan et al., 1 Jun 2025)
TRIM-KV Retention-gate per head Exponential time-decay, distillation Per-head granularity, unlearned joint training (Bui et al., 3 Dec 2025)
GraphKV Graph + decay propagation Penalizes redundancy, plug compatible Graph formation cost (Li et al., 30 Aug 2025)
SAGE-KV Self-attn head scoring One-time, per-head top-βt,ℓ,h\beta_{t,\ell,h}2 Irreversible after prefill, ignores recency (Wang et al., 11 Mar 2025)
Structured Retention Sigmoid head + adapt. thresh. Probabilistic, multi-partition CMP Overhead for small βt,ℓ,h\beta_{t,\ell,h}3, risk of overfit (Delena et al., 5 Feb 2025)
PromptDistill Last-token–to–all-query Retain intermediate cache, multi-stage Requires empirical choice of selection depths (Jin et al., 30 Mar 2025)
CASK Core/scratch decomposition Fold redundant regions, prefix slack Oracle mass for core selection, tuning group size (Kim et al., 13 Apr 2026)
Transactional Attn Structural anchor sponsors Defends dormant/credential tokens Manual anchor design, not general to all tasks (Basu, 13 Apr 2026)
MiKV Mixed-precision retention Retained in INT2/3/4, high-rank in FP16 May lose rare details if both quantized/evicted (Yang et al., 2024)

The field of selective token retention and eviction thus encompasses a spectrum of formally principled and carefully engineered solutions—enabling long-context and memory-bounded models to scale efficiently, maintain fidelity, and perform targeted unlearning or privacy repair, while ensuring system robustness and quality across diverse use cases and workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Token Retention and Eviction.