Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
Abstract: Under modern test-time compute and agentic paradigms, LLMs process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
This paper tackles a practical problem in LLMs: when they read or write very long pieces of text, they slow down and use a lot of memory because they keep “notes” about every word they’ve seen. These notes are called the key–value (KV) cache. The authors introduce a smarter way, called Self-Pruned Key-Value Attention (SP-KV), that teaches the model to decide which notes are worth keeping for later and which ones it can safely forget. The goal is to save memory and speed up generation without hurting the model’s quality.
What questions the paper asks
The paper focuses on simple but important questions:
- Do we really need to save a note for every single word the model has seen?
- Can the model learn to predict which past notes it will actually need in the future?
- If we keep only the useful notes, can we keep the same quality while using much less memory and running faster?
How the method works (in everyday terms)
Think of the model as a student working on a long assignment:
- Short-term notes vs. long-term notes:
- The model always keeps a small “recent window” of notes (for example, the last 128 words) in short-term memory. This covers local details, like finishing a sentence.
- For older parts, it only saves notes into a long-term “binder” if it thinks they’ll be useful later.
- A tiny helper that predicts usefulness:
- A small extra component (a lightweight predictor) looks at each note as it’s created and gives it a score between 0 and 1: “How useful will this be later?”
- If the score is high enough (above a threshold), the note goes into long-term storage. If not, it’s skipped. Recent notes are always kept, no matter the score.
- Training in two stages, like a dimmer switch turning into an on/off switch: 1) Practice mode (soft gating): Instead of hard keep/throw-away decisions, the model uses a “dimmer” that slightly strengthens or weakens how much a note can be used. This keeps training smooth. 2) Test mode (hard gating): Near the end, the dimmer turns into a clear on/off decision using a threshold. This matches how the system will run for real and helps the model finish learning under the exact rules it will use later.
- No special tricks needed:
- The main model and the usefulness predictor are trained together using the standard “predict the next word” objective. There’s no extra complicated loss to teach sparsity. The model naturally learns what to keep.
- Under the hood, but simply:
- The team built the method into fast attention code so that skipping unnecessary notes actually saves time and memory in practice, especially for long texts and batched decoding.
What they found and why it matters
Here are the main takeaways, explained simply:
- Big memory savings and speedups:
- The method often keeps only about 10% to 50% of the long-term notes, depending on the task—commonly around 30%.
- This shrinks the long-term KV cache by about 3x to 10x.
- In long-text, batched generation, decoding can be roughly 2.1x to 4.6x faster.
- Little to no quality loss:
- Across many standard benchmarks, performance stayed essentially the same (about -0.2% on average compared with the full model).
- On long-context tests, results stayed near baseline up to 16k tokens; at 32k tokens there was a small drop, likely because 32k was the longest length seen during training.
- On “find the needle in a haystack” style tasks, the model needed to keep only about 5%–7% of old notes and still found the right information.
- Adjustable trade-off:
- You can turn the threshold knob to keep more notes (higher quality) or fewer notes (more speed/memory savings). This makes it easy to adapt to different needs.
- Scales well:
- As model size and training compute grow, SP-KV follows the same performance trend as regular attention. In short, it keeps up without falling behind as things scale.
- Better than “after-the-fact” pruning:
- Methods that prune notes only at the end (without retraining the model to expect pruning) often hurt quality more. Because SP-KV learns during training, it avoids this mismatch and keeps quality high at similar memory savings.
- Helps design smarter architectures:
- By observing which parts of the model keep lots of notes, the authors can tell which layers or heads really need long-range attention. Using this signal, they design hybrid models (mixing local and global attention) that work better under the same memory budget.
Why this research matters (implications and impact)
- Practical efficiency for long work:
- LLMs used in long conversations, research assistants, coding agents, or tools that retrieve lots of context can run faster and fit longer inputs on the same hardware.
- Flexible and future-proof:
- Because SP-KV is learned together with the model, it can carry over into later training stages (like instruction tuning or reinforcement learning) and keep saving time and memory during long rollouts.
- Better model design:
- The method doubles as a “map” of where long-range memory really counts, guiding the creation of stronger, more efficient attention layouts.
- Limitations and next steps:
- Most experiments used mainly English and focused on pretraining-style evaluations. Testing in multilingual and specialized domains, and further optimizing the low-level code, are important next steps.
In short: SP-KV teaches LLMs to keep only the notes that matter. This makes them lighter and faster on long texts, while keeping their smarts intact—and it even shows us how to build better models in the future.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to be actionable for future research.
- Transferability of the learned sparse-write policy to multilingual corpora, code-heavy datasets, and domains with different long-context statistics (e.g., legal, biomedical, conversational logs).
- Behavior after post-training stages (supervised fine-tuning, instruction tuning, reinforcement learning): policy drift of gates, need for re-training or threshold recalibration, and impact on task-specific performance.
- Systems and hardware generality: end-to-end wall-clock gains on non-Hopper GPUs (A100/V100/consumer GPUs), multi-GPU setups, distributed inference/training, and different frameworks; requirements for dedicated sparse-KV kernels, scheduling, and memory layouts to consistently convert sparsity into throughput.
- Scaling beyond 8.1B to 30B–70B and larger models: stability, sparsity patterns, and performance trade-offs at frontier scales; interaction with increased head counts and deeper networks.
- Maximum context lengths beyond 32k: sparsity-performance frontier, failure modes, and tuning when pushing toward 64k–1M tokens and streaming/infinite-context regimes.
- Thresholding strategy: whether global per-model T is optimal versus per-layer, per-head, or per-sequence thresholds; methods for automatic threshold selection at inference to meet memory or latency budgets without degrading quality.
- Local window size (w=128) choice: sensitivity analyses, task-specific tuning, and learning w per layer/head; impact on tasks requiring medium-range dependencies that exceed the fixed window.
- Utility predictor architecture: exploration of alternatives (e.g., query-conditioned predictors, multi-layer/contextual predictors, convolutional/attention-based predictors), their cost-benefit, and interpretability of learned utility signals.
- Joint read-time and write-time sparsity: systematic evaluation of SP-KV combined with query-aware sparse-read methods (e.g., QUEST, RetrievalAttention) and whether synergies or conflicts emerge.
- Synergy with other KV reduction techniques (quantization like KIVI, token merging/Minicache, cross-layer compression): composability, cumulative gains, and optimal stacking order.
- Robustness under distribution shift and rare long-range dependencies: comprehensive failure analysis on tasks requiring retrieval of infrequent facts, long proofs, multi-hop reasoning, and complex code generation beyond benchmarks reported.
- Fair apples-to-apples comparisons with baselines: using the same pretrained base model and data mixture for SP-KV and post-hoc methods to isolate the sparsification mechanism’s effect across downstream tasks, not just LongPPL NLL.
- End-to-end latency and memory gains: measurements across batch sizes (1–128), sequence lengths (short to very long), densities (10–100%), and real serving pipelines including CPU I/O and network, with multi-tenant load patterns.
- Alignment, safety, and factuality impacts: whether gating disproportionately drops safety-relevant instructions, disclaimers, or factual context; audits and mitigation strategies.
- Kernel granularity: effects of 64-token block skipping versus finer-grain token-level skipping on throughput, cache locality, and implementation complexity.
- Training schedule sensitivity: dependence on cosine decay, gate bias initialization, and threshold-aware hard gating annealing; general recipes that avoid instability while achieving target sparsity.
- Compute-optimality shifts: whether SP-KV changes the optimal tokens-per-parameter ratio, scaling exponents, or learning curves relative to dense attention under matched budgets.
- Applicability to encoder–decoder and multimodal architectures: extending SP-KV to cross-attention, audio/vision modalities, and retrieval-augmented generation (RAG) where pruning might discard retrieved context.
- RAG-specific policies: mechanisms to preserve retrieved passages or tool outputs (e.g., “do-not-prune” annotations, sink tokens) and evaluate effects on retrieval faithfulness.
- Numerical stability in mixed precision: behavior of log(u) gating and atomic gradient accumulation under BF16/FP8, potential underflow/overflow, and reproducibility across libraries.
- Streaming and sessioned generation: gate behavior when contexts evolve over long sessions, cache resets, or partial cache reuse; policies to avoid thrashing (frequent open/close of gates).
- Stability of SP-KV gate statistics used for hybrid architecture search: whether head-wise density rankings remain stable across datasets, training phases, and domains; criteria for when to re-profile and reallocate global heads.
- Overhead accounting: precise measurement of the utility predictor’s parameter/memory footprint and kernel overhead; at what densities and sequence lengths does SP-KV overhead negate benefits.
- Per-layer/head budget control: evaluating dynamic budget allocation (learned or heuristic) instead of a single global threshold to improve sparsity–quality trade-offs.
- Rare token preservation: integrating mechanisms akin to sink tokens or key-token heuristics into SP-KV to guard against pruning of crucial but infrequent information.
- Evaluation beyond RULER: inclusion of ultra-long-context reasoning datasets, multi-document QA, long codebases, and longitudinal conversation benchmarks to test real-world long-memory demands.
- Extension to MoE/GQA/MQA/MHA variants: systematic study of SP-KV behavior across attention parametrizations (e.g., MoE experts, shared keys), with clarity on where sparsification is most effective.
- Language typology effects: performance on languages with rich morphology and long-distance syntactic dependencies (e.g., Turkish, Finnish), to verify whether learned gating preserves necessary long-range signals.
- Training-time efficiency in RL or long-context SFT: measured reductions in GPU memory traffic and wall-clock during long rollouts; whether SP-KV enables larger batch sizes or longer sequences in practice.
- Auxiliary losses or supervision for utility: whether weak supervision (e.g., retrieval markers, task labels) or contrastive objectives can improve gate calibration without hurting generalization.
Practical Applications
Summary
The paper introduces Self-Pruned Key-Value Attention (SP-KV), a learned “sparse-write” mechanism for transformer LLMs that predicts which key–value (KV) pairs merit retention in the long-term cache, while keeping a fixed local window (e.g., 128 tokens). Trained end-to-end with next-token prediction and implemented with minimal kernel overhead (FlashAttention-3 on Hopper), SP-KV reduces non-local KV cache by ~3–10x and speeds up decoding (2.1–4.6x in memory-bound regimes) with negligible average degradation on standard benchmarks. It also exposes layer/head-specific sparsity patterns that can guide design of hybrid local–global attention architectures.
Below are concrete, real-world applications grouped by immediacy, with relevant sectors, tools/workflows, and key dependencies or assumptions.
Immediate Applications
- Efficient LLM serving and cost reduction
- Sectors: software, cloud/infra, finance, legal, customer support
- What: Deploy SP-KV–trained checkpoints to cut KV memory and bandwidth, increasing batch sizes and/or context lengths and lowering per-token latency on existing GPUs.
- Tools/workflows: Integrate SP-KV kernels into inference stacks (vLLM, TGI, TensorRT-LLM), add a runtime “sparsity knob” (threshold T) to meet SLOs.
- Dependencies/assumptions: Requires continued pretraining or access to SP-KV–adapted weights; best performance on GPUs with FA3-like kernels (Hopper+); minor task-dependent regressions must be monitored.
- Longer-context RAG and enterprise assistants
- Sectors: legal (contracts), finance (filings), healthcare (EMRs), education (courseware), research (literature)
- What: Support longer documents and larger retrieved contexts under fixed memory by pruning non-useful long-term KVs while preserving recent tokens; maintain quality up to ~66% sparsification.
- Tools/workflows: RAG pipelines with “always-keep” tags for critical citations/IDs; prompt templates that place anchors within the local window; dynamic T based on retrieval confidence.
- Dependencies/assumptions: Retrieval chunks should align with utility predictor; configure sinks/whitelists for must-keep tokens; validate on domain data.
- Higher-throughput multi-tenant serving
- Sectors: cloud platforms, model API providers
- What: Increase concurrent sessions and reduce tail latency by lowering KV cache and memory traffic.
- Tools/workflows: Autoscalers tune T per-tenant; admission control uses predicted density p to pack batches.
- Dependencies/assumptions: Telemetry to track density–quality trade-offs; safeguards for latency/quality SLOs.
- On-device and on-prem LLMs with larger contexts
- Sectors: mobile, edge, regulated industries
- What: Fit longer-context models on consumer GPUs or edge devices by shrinking KV footprints, enabling private/offline assistants with richer context.
- Tools/workflows: Combine SP-KV with quantization (e.g., K/V quant), CPU–GPU paging policies aware of gated KVs.
- Dependencies/assumptions: Kernel and memory manager support for sparse KV layouts; evaluate battery/thermals; some tasks (e.g., code) show small regressions.
- Training-time efficiency for post-training and RL agents
- Sectors: LLM providers, robotics, tools/agent frameworks
- What: Apply SP-KV during instruction-tuning or RL with long rollouts to reduce memory bandwidth and accelerate experiments without major loss.
- Tools/workflows: Integrate soft gating in training loops; use Threshold-Aware Hard Gating (TAHG) near the end of schedules.
- Dependencies/assumptions: Continued training with SP-KV recommended; verify distribution shift (agentic traces vs. pretraining text).
- Product-level quality–efficiency controls
- Sectors: SaaS productivity, IDEs, collaboration tools
- What: Expose a user-visible “efficiency slider” (adjusting T) to trade latency/battery for fidelity in long sessions (e.g., summarizing large docs, codebases).
- Tools/workflows: Client SDK toggles between high-sparsity and high-fidelity modes; telemetry-driven default profiles.
- Dependencies/assumptions: Clear UX and guardrails for quality-sensitive tasks; offline A/B to set safe ranges.
- Interpretability and model diagnostics
- Sectors: academia, safety teams, MLOps
- What: Use gate activations to identify which tokens/layers/heads carry long-range utility; debug retrieval failures and prompt placement.
- Tools/workflows: Dashboards showing per-head densities and token-level utility over time; regression alerts when density patterns drift.
- Dependencies/assumptions: Logging of gate values; privacy controls for token-level traces.
- Architecture guidance for hybrid local–global attention
- Sectors: LLM R&D (industry and academia)
- What: Leverage SP-KV density statistics to choose which heads become global in hybrid designs, improving performance at a fixed KV budget.
- Tools/workflows: Automated head selection pipelines; NAS scripts optimizing “density coverage.”
- Dependencies/assumptions: Requires a reference SP-KV model to collect statistics; transferability across data/domains should be validated.
- Security and compliance posture improvement
- Sectors: healthcare, finance, public sector
- What: Smaller and shorter-lived KV caches reduce data residency and attack surface during inference.
- Tools/workflows: Policies to purge sparse caches; logs proving reduced memory footprints.
- Dependencies/assumptions: Must not prune compliance-critical tokens; integrate “must-keep” sinks; audit trails for gate decisions.
- Library and framework extensions
- Sectors: open-source ecosystems, tooling vendors
- What: Add SP-KV modules to PyTorch/Hugging Face/OnnxRuntime backends and scheduler APIs for T and window size w.
- Tools/workflows: FA3-based kernels with block skipping; config templates for popular backbones (Llama, Gemma).
- Dependencies/assumptions: Community-maintained kernels; compatibility with quantization/pruning stacks.
Long-Term Applications
- Hardware and runtime co-design for sparse KV
- Sectors: semiconductors, systems software
- What: Architect memory layouts, schedulers, and tensor cores for block-skipping and sparse KV access; expose KV-density-aware prefetching in compilers.
- Tools/workflows: Triton/CUDA kernels with per-block masks; hardware counters for KV bandwidth.
- Dependencies/assumptions: Vendor support; standards for sparse-KV formats across frameworks.
- Extreme long-context models (100k–1M tokens)
- Sectors: research tools, legal e-discovery, scientific curation
- What: Combine SP-KV with local/global hybrids and retrieval to scale contexts far beyond 32k while keeping costs practical.
- Tools/workflows: Curriculum with long-context data; prompt engineering to keep anchors in the local window.
- Dependencies/assumptions: Additional long-context training data; robust kernel scaling and memory management.
- Cross-domain and multilingual adaptation
- Sectors: global enterprises, localization
- What: Re-train or adapt utility predictors for non-English and domain-specific distributions (code, biomedical, legal).
- Tools/workflows: Domain-aware SP-KV midtraining; per-domain thresholds.
- Dependencies/assumptions: Access to representative corpora; potential re-tuning when distribution shifts.
- Multimodal and speech LLMs with sparse-write caches
- Sectors: media, meetings intelligence, assistive tech
- What: Extend SP-KV gating to audio/vision tokens to trim long-context caches in transcriptions, video understanding, or VLMs.
- Tools/workflows: Modality-aware utility predictors; synchronization with frame/segment boundaries.
- Dependencies/assumptions: New gating architectures per modality; data and training recipes.
- Automated architecture synthesis and compilers
- Sectors: model tooling, AutoML
- What: Use SP-KV statistics to compile models into static hybrids (head/layer allocation) or generate per-workload blueprints.
- Tools/workflows: Ahead-of-time compilers that emit optimized hybrids; CI pipelines that regenerate layouts when training data changes.
- Dependencies/assumptions: Stability of density patterns across tasks; rules for safe fallback when patterns drift.
- Joint sparsification and compression stacks
- Sectors: deployment at scale, edge
- What: Combine SP-KV (sparse writes) with KV quantization/merging and read-time retrieval to push cost lower without quality loss.
- Tools/workflows: Budget orchestrators allocating between density, bits, and merge ratios; test suites for interaction effects.
- Dependencies/assumptions: Careful evaluation of compounding errors; per-task tuning.
- Privacy-preserving, energy-aware inference policies
- Sectors: policy, sustainability, regulated IT
- What: Define procurement and reporting guidelines that include KV-density metrics and energy-per-token for long-context workloads.
- Tools/workflows: Model cards including KV sparsity and speedup factors; sustainability dashboards.
- Dependencies/assumptions: Community consensus on metrics; independent audits.
- Safety-critical guardrails for dynamic sparsification
- Sectors: healthcare, aviation, autonomous systems
- What: Enforce dense or minimally sparse modes for high-risk prompts; certify SP-KV behavior under worst-case inputs.
- Tools/workflows: Policy engines that elevate T or disable pruning on safety predicates; continuous verification tests.
- Dependencies/assumptions: Reliable prompt classification; cost of fallback modes accepted.
- Learning-to-control sparsity with external signals
- Sectors: agents, robotics, operations
- What: Condition T or gate logits on retrieval scores, tool outputs, or uncertainty to allocate memory where it matters most.
- Tools/workflows: Controllers that modulate gates online; reward-shaping in RL for sparsity-vs-quality.
- Dependencies/assumptions: Additional training for stability; safeguards against adversarial inputs.
- Developer-facing observability and AIOps
- Sectors: MLOps, platform engineering
- What: Build monitors tying KV density to latency, cost, and task success rates; auto-rollback when quality dips.
- Tools/workflows: SLO-driven autoscaling (adjust T); canary pipelines; density heatmaps per layer/head.
- Dependencies/assumptions: Low-overhead metrics; robust correlation analyses to avoid false positives.
Notes on feasibility and assumptions across applications:
- Benefits are largest in memory-bound, long-context, or batched decoding regimes; speedups shrink with short sequences or near-dense settings.
- Reported results are primarily on English-centric pretraining; transfer to multilingual and specialized domains requires validation and potentially additional SP-KV midtraining.
- Some tasks (e.g., code generation) show small degradations; organizations should adopt per-task thresholds and “must-keep” token policies.
- Systems gains depend on high-quality kernel implementations and scheduler awareness of sparse KV patterns; further engineering can unlock larger wall-clock improvements.
Glossary
- Additive bias: An extra term added to attention scores to enforce constraints or masks. "We combine this causal mask and the gating into a single additive bias, as"
- Annealing: A gradual transition technique to stabilize training when changing regimes. "we smooth the transition through annealing for models trained with 32k context windows."
- Autoregressive generation: Sequential token generation where each token conditions on previously generated tokens. "During autoregressive generation, their key-value (KV) cache grows linearly with sequence length"
- Binarized gates: Utility gates reduced to binary on/off decisions based on a threshold. "binarized gates enable a block-skipping optimization"
- Block-skipping optimization: An inference-speed technique that skips computation for entirely pruned KV blocks. "binarized gates enable a block-skipping optimization: before the kernel launch, we precompute a per-head sparsity mask"
- Causal local sliding window: A fixed-size recent-context region always allowed for attention under causality. "we always allow attention within a causal local sliding window of size w (by default 128)."
- Causal mask: A mask that prevents attending to future positions to preserve autoregressive causality. "Let Mcausal (t, s) E {0,-o0} be the standard causal mask bias (0 if s ≤ t, -. otherwise)."
- Continual pretraining: Further pretraining of a model from an existing checkpoint to adapt new mechanisms or data. "typically through continual pretraining from a pretrained full attention checkpoint."
- Cosine-decay schedule: A learning-rate schedule that decays following a cosine curve. "for the first 75% of the cosine-decay schedule"
- Decoding rollouts: Long sequences generated during inference or RL training episodes. "extended decoding rollouts (Zhu et al., 2025; Wang et al., 2025)."
- Density coverage: The fraction of useful global keys preserved by a hybrid architecture relative to a reference. "The density coverage (ratio of global keys from the reference model that would remain global under the new architecture) largely distinguishes the four architectures."
- Differentiable gating: A continuous gating mechanism used during training to maintain gradient flow. "During training, token selection is replaced by differentiable gating to preserve gradient flow."
- End-to-end: Jointly training all components directly on the task objective without separate supervision. "trained jointly end-to-end exclusively through next-token prediction loss"
- Eviction policies: Strategies to remove or retain entries in the KV cache under memory constraints. "Early eviction policies preserve recent tokens and attention sinks only (StreamingLLM, Xiao et al. (2024b))"
- FlashAttention-3: A fast attention kernel optimized for modern GPUs. "using FlashAttention-3 kernels."
- FLOPs: Floating-point operations, a measure of computational cost. "Training compute (FLOPs)"
- Gated attention: Attention modified by learned gates that restrict which keys/values are available. "The resulting gated attention for query position t is then given by"
- Gated DeltaNet: A fixed-memory sequence mechanism used as an alternative to standard attention. "Gated DeltaNet (Yang et al., 2025)."
- GPU memory traffic: Data movement between GPU memory and compute units that can bottleneck performance. "turns GPU memory traffic into a central performance bottleneck."
- Grouped Query Attention (GQA): An attention variant that shares keys/values across groups of query heads to reduce KV size. "All models rely on standard Grouped Query Attention (GQA) which already reduces the KV-cache size by 4-6x compared to MHA."
- Hopper GPUs: NVIDIA GPU architecture targeted by the optimized kernels in this work. "for Hopper GPUs."
- Hybrid local-global attention: Architectures mixing local sliding-window layers with global attention layers to save memory. "hybrid transformers reduce their reliance on global attention by interleaving the usual global attention with local sliding-window attention (Beltagy et al., 2020; Gemma Team, 2024)"
- Instruction tuning: A post-pretraining phase where models are optimized to follow instructions. "it can be naturally carried into later training stages, such as instruction tuning or reinforcement learning,"
- kNN memories: External memories that retrieve nearest neighbor vectors to augment attention. "kNN memories, block memories, or fast key retrieving subnetworks"
- KV cache: The stored keys and values from past tokens used to compute attention during decoding. "the Key-Value cache memory footprint and bandwidth."
- KV utility predictor: A lightweight module that estimates the future usefulness of each KV pair. "The learned KV utility predictor conditions key-value utilization in the attention operation."
- Landmark-based access patterns: Sparse-read schemes that use landmark tokens for efficient long-context access. "landmark-based access patterns (Mohtashami and Jaggi, 2023),"
- LongPPL: A long-context validation benchmark used to assess perplexity. "evaluated on the LongPPL validation set (Fang et al., 2025)."
- Memory-bound regimes: Settings where performance is limited primarily by memory bandwidth/latency rather than compute. "The results show clear gains in memory- bound regimes, especially for batched long-context decoding."
- MHA (Multi-Head Attention): The standard attention mechanism with multiple independent heads. "which already reduces the KV-cache size by 4-6x compared to MHA."
- MLP (Multi-Layer Perceptron): A small feedforward network used here as the utility predictor. "a 2 layer perceptron (MLP)."
- Negative log-likelihood (NLL): A standard language modeling loss metric equivalent to log-perplexity up to scaling. "negative log- likelihoods (NLLs)"
- Needle in a Haystack (NIAH): A long-context retrieval diagnostic where a small key fact must be found in long text. "Needle in a Haystack (NIAH) requires only about 5-7% retained KV entries"
- Next-token prediction loss: The training objective to predict the next token in a sequence. "exclusively through next-token prediction loss"
- Online softmax: A numerically stable streaming softmax computation used in FlashAttention. "FA3's online softmax"
- Pareto-dominates: Achieving better trade-offs on two objectives simultaneously (e.g., sparsity vs. NLL). "Full attention + SP-KV Pareto-dominates the Hybrid 3:1 configuration (Codegen Team et al., 2025),"
- Persistent KV cache: The long-term portion of the KV cache retained beyond the local window. "only KV pairs above a given utility thresh- old T are kept in the persistent KV cache,"
- Power law: A functional form modeling how metrics scale with compute. "we fit a one-dimensional power law to the empirical negative log- likelihoods (NLLs)"
- Prefill: The initial pass to populate the KV cache before decoding or pruning. "have attempted to prune the cache after prefill using past token statistics"
- Reinforcement learning (RL): Learning paradigm optimizing behavior via rewards; here used for long-context training. "long- context reinforcement learning and tool-integrated agent training rely on extended decoding rollouts"
- RULER: A benchmark assessing effective long-context capabilities. "RULER long-context benchmark (13 subtask types)"
- Sliding-window attention: Attention restricted to a fixed local window around each token. "interleaving the usual global attention with local sliding-window attention"
- Sparsification: Making attention or caches sparse by removing low-utility elements. "SP-KV performs dynamic sparsification"
- Straight-through estimator: A gradient estimator that treats discrete operations as identity in backprop. "relies on stochastic sampling and a straight-through estimator (Bengio et al., 2013)."
- Threshold-Aware Hard Gating (TAHG): A training phase using binarized utility gates aligned with inference thresholds. "Thresholding-Aware Hard Gating (TAHG)."
- TMA/MMA compute: GPU data-transfer (TMA) and matrix-multiply-accumulate (MMA) operations referenced by kernels. "The kernel skips both TMA loads and MMA compute"
- Token-to-parameter ratio (TPP): A scaling metric for training that sets tokens per non-embedding parameter. "140 training tokens per non-embedding parameter (TPP)"
- Utility predictor: The model component that scores each KV pair’s future usefulness. "A lightweight utility predictor assigns a utility score to each KV pair;"
- Warmup-stable learning rate schedule: A schedule with an initial warmup and then a stable phase before decay. "using a warmup-stable learning rate schedule."
Collections
Sign up for free to add this paper to one or more collections.