Still: Amortized KV Cache Compaction in a Single Forward Pass

Published 5 Jun 2026 in cs.LG | (2606.07878v1)

Abstract: The KV cache is the memory bottleneck of long-horizon LLM deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces an amortized, synthesis-based compactor using a Perceiver module to achieve efficient one-pass KV cache compaction in large language models.
The paper details model transferability across varying architectures, maintaining high cache utilization and robust performance even at high compression ratios.
The paper demonstrates that iterative compaction with targeted training regimes enables scalable LLM deployment, achieving competitive performance on long-context summarization tasks.

Still: Amortized KV Cache Compaction in a Single Forward Pass

Problem Setting: KV Cache Bottleneck and Compaction Paradigms

The key-value (KV) cache represents the dominant memory bottleneck for long-horizon deployment of autoregressive LLMs. Scaling LLMs to multi-turn, multi-day, or repository-scale tasks imposes a severe linear memory burden as the KV cache grows with context length, creating a fundamental systems constraint for context extension and deployment. While alternatives such as retrieval-augmented generation or document summarization exist, these approaches do not preserve the model's internal representations. Thus, efficient, practical, and loss-minimizing cache compaction mechanisms are required.

Prior compaction strategies are broadly grouped along two orthogonal axes: selection vs. synthesis and per-context vs. amortized computation. Selection techniques, whether heuristic- or learning-based, retain a subset of original token KV entries, inevitably subset-bound in representational power. Synthesis approaches generate new, content-dependent KV entries, circumventing the subset floor but often requiring per-input optimization, which is infeasible at inference-time and in streaming regimes. Amortized variants, such as KV-Distill, learn to select tokens globally across contexts, but still operate as selection in nature.

Method: The Still Compactor

Still introduces a Perceiver-based module deployed per transformer layer, designed for amortized, synthesis-based compaction of the KV cache in frozen LLMs. Each compactor maintains a bank of layerwise learned latent queries that cross-attend the full, concatenated keys and values (with position encoding undone and reapplied as needed), then project the result to a configurable number of compact K, V entries in a single forward pass. The design is plug-in, non-intrusive to base model weights, and supports both global and hybrid (e.g., sliding-window/global) attention stacks.

The compactor architecture is fixed for experiments (default: de=256, B=2 blocks, t=128), exposing compression ratio as the principal scaling parameter. Initialization is identity-like for stability when t=T, and supports chunked contexts and iterative application.

Key features include:

Position-free compaction: Keys are de-rotated to remove position-dependence, with RoPE phases handled internally, avoiding instabilities from compaction mixtures across positions.
One-pass and iterative compaction: For static prefixes, compaction is a one-time map; for streaming or very long contexts, an iterative chunked schedule with interleaved lookahead retains constant local compression, allowing for scalable context ingestion.
Amortized training objective: Each compactor is trained using forward KL from the full-context teacher distribution, with only the compactor parameters updated (base LM frozen).

Empirical Results: Speed-Quality, Long-Context, and Summarization

Speed-Quality Frontier

Still consistently occupies a superior position on the speed-quality Pareto frontier across 8x–200x compression and 8k–128k context windows, compared to representative selection, synthesis, and amortized baselines (H2O, SnapKV, StreamingLLM, Attention Matching, KV-Distill). Notably, selection-based methods degrade rapidly under tight memory budgets, and on long contexts, only Still and per-context synthesis methods maintain accuracy, with Still offering orders-of-magnitude efficiency gains due to amortization.

Model Transferability

The per-layer compactor readily transfers across model scales (Qwen3-4B up to Qwen3-32B and MoE architectures) and to models with mixed attention mechanisms (e.g., Gemma-3). Quantitative results across these settings corroborate the utility of the approach, matching or exceeding fixed-cache baselines while maintaining strong cache utilization (e.g., MCQ accuracy utilization consistently above 60–70% at compression ratios up to 50x).

Long-Context and Iterative Compaction

On long-context benchmarks such as RULER, Still notably exceeds KV-Distill by 8–22 accuracy points across 16/18 matched configurations, confirming the practical expressivity benefit of amortized synthesis over amortized selection. Iterative compaction experiments further show that compactors trained on longer horizons (16k/32k) preserve utility over 128k contexts, whereas those trained only at short horizons collapse, demonstrating a strong dependence on training regime for scaling iterative compaction.

Free-form Summarization

Still's compact cache supports not only answer selection but also free-form summarization. On HELMET multi_lexsum and LongBench v1 pairwise summarization audits, Still preserves most of the full-context gain (recovering 74–95% across 8k–64k, and even 59% at 128k), and outperforms both Attention Matching and KV-Distill, with a gap widening with context length and compression.

Architectural and Training Insights

Ablation studies reveal that the primary benefits result from joint synthesis of keys and values, rather than simply writing better values into selected positions; both components are required for optimal performance. The architecture is robust to modifications (self-attention removal, MLP substitution, cross-attention head count), and increasing the number of latents yields diminishing returns beyond 1024 for 8k context, consistent across domains.

Critically, the iterative compaction regime surfaces an inherent limitation: the training horizon dictates safe extrapolation length. Compactors trained at limited sequence lengths degrade beyond their envelope, highlighting a need for curriculum or direct-iteration-aware training regimes for deployment at extreme context windows.

Implications and Future Directions

This work introduces a practical, amortized, and synthesis-capable compactor enabling long-context LLM deployment with tractable hardware requirements. The single-pass, non-intrusive design, and empirical utility across multiple architectures and tasks demonstrate a significant step toward scalable, efficient memory management in LLMs.

Open directions include:

Extending amortized synthesis to support constant-bounded recurrent memory (rather than fixed compression ratio with linear cache growth).
Exploring curriculum/mixture training to broaden and stabilize the compactor's operating envelope for million-token or longer deployments.
Enhancing retrieval fidelity for exact/needle tasks, which remain an open challenge for current compaction and summarization techniques.
Integrating adaptive or dynamically-configured bias heads in deployment, potentially informed by calibration diagnostics.

Conclusion

Still presents an amortized, synthesis-based KV cache compaction approach that is tractable at inference, expressively outperforms selection under high information density and compression, and generalizes across model scales and architectures. Its iterative application and transfer to both MCQ and generative summarization tasks establish it as a core primitive for long-context LLM systems, while also elucidating the importance of matched-horizon training and future work on adaptive, recurrent memory management strategies (2606.07878).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

1) What this paper is about (big picture)

LLMs remember what they’ve read using a short-term memory called the KV cache. This memory grows with the length of the text, so it gets huge for long documents or long chats. The paper introduces “Still,” a small add-on that shrinks this memory without losing too much important information. The goal is to let LLMs handle much longer contexts, faster and with less memory, while keeping accuracy high.

2) The key questions the paper asks

Can we compress an LLM’s memory during use (inference) quickly, in one step, instead of slowly customizing it for each new document?
Can this compression keep the most useful information, not just pick a few original pieces, so the model stays accurate even at high compression?
Will this work across different model sizes and types, and for both answering questions and writing summaries?
Can we keep compressing as new text arrives over time (like a long-running conversation) without the method falling apart?

3) How the method works (in simple terms)

Think of the KV cache as a big notebook the model keeps while reading: “keys” are like labels that help find things, and “values” are the notes themselves. The problem: the notebook gets too thick.

Still adds a tiny “note-taking assistant” to every layer of the model. Here’s the idea:

It reads the whole big notebook for that layer.
It asks smart questions (cross-attention) to figure out which parts matter most.
It writes a much shorter, organized summary—new compact keys and values—that the model can use later as if it were the original notebook.

Key points explained:

One forward pass: Still does this in a single quick run, like summarizing a chapter in one go.
Amortized training: You train the note-taking assistant once, and then reuse it on any new text. This is like teaching someone how to summarize well so they can summarize any book quickly later.
Synthesis vs. selection: Instead of keeping a small subset of original notes (selection), Still creates new compact notes that blend the important bits together (synthesis). That makes the short notebook more expressive.
Position handling (RoPE “un-rotate/re-rotate”): The model’s keys depend on where words appear (their positions). Still temporarily removes that position effect to summarize content cleanly, then adds positions back so the model can use the short notes correctly.
Iterative compaction: For long streams (like a multi-day agent), Still can compress in chunks—summarize each new section and keep moving—so memory grows slowly, not as fast as the raw text.

Training setup in everyday terms:

The base LLM is frozen (unchanged).
Still learns by trying to make the compact-memory model produce similar answers to the full-memory model. It compares their predictions on answer tokens and nudges its summaries to match (this is done with a “KL divergence” loss, which you can think of as “make your guesses look like the teacher’s guesses”).
After training, Still is just a fast plug-in you call during inference.

4) What they found and why it matters

Highlights:

Strong speed–quality trade-off: Across many tests, Still stayed both fast and accurate, even when compressing the cache 8x up to 200x and for long contexts (8k to 128k tokens). Other methods were either fast but lost too much accuracy, or accurate but slow.
Works across models: It transferred well from smaller to larger Qwen models (4B to 32B, including a mixture-of-experts) and to Gemma-3’s attention setup.
Beats a leading learned baseline: On long-context tasks (RULER), Still beat KV-Distill by 8–22 points in most matched settings, especially when contexts are long and memory is tight.
Helps summarization, not just Q&A: On HELMET (multi_lexsum), Still kept 74–95% of the full-context benefit at 8k–64k, and still kept 59% at 128k, outperforming comparison methods at every length. On LongBench (free-form summaries), Still won 60% of head-to-head judged comparisons versus KV-Distill.
Can run repeatedly as text grows: Because compaction is a simple forward pass, Still can keep summarizing new chunks over time. Training it on longer “horizons” made it hold up better at very long contexts (e.g., 128k tokens).

Why this matters: It shows you don’t have to choose between “keep everything” (too big/slow) and “throw most of it away” (lose accuracy). You can “repack” the memory into a smaller, still-useful form quickly and reuse the same trained compressor across many inputs.

5) What this could change (impact and limits)

Impact:

Longer, cheaper, faster: Apps like coding agents, research assistants, or long customer chats can run much longer without running out of memory or slowing down too much.
Plug-in for open models: Since the base model isn’t changed, Still can be trained once per model checkpoint and then dropped in as a lightweight add-on.
Better than picking a few tokens: Synthesizing compact notes keeps important info even when you must compress a lot, which selection methods struggle with.

Limits to keep in mind:

Not lossless: Full context still performs best, especially at the hardest, longest settings.
Training horizon matters: To stay strong at very long streams (like 1 million tokens), you need to train Still on similarly long, chunked scenarios.
Memory still grows (slowly): In the iterative mode, memory grows at a reduced rate, not a fixed constant size yet.
One compactor per base model: Each base model needs its own trained Still module.

In short, Still is like teaching a fast, reliable note-taker to compress an LLM’s memory on the fly. It keeps much of the model’s power while using far less memory and time, making very long contexts practical in everyday use.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper and could guide future research.

Constant-budget recurrence: The iterative schedule maintains a fixed compression ratio but not a constant absolute cache size (O(1) memory). How to design and train in-place slot-reuse/merge policies that truly bound the cache independently of trajectory length?
Training under recurrence: The compactor is trained with single-pass KL (except where noted) and then deployed recurrently. What training objectives and schedules (e.g., unrolled recurrence, truncated BPTT, policy gradients) best stabilize multi-pass, long-horizon compaction?
Curriculum over horizon: Performance collapses when the deployed number of compaction steps exceeds the training horizon. What curricula (progressively increasing chunk count/length) or synthetic long-horizon datasets most efficiently teach stability to 100–1000+ compaction steps?
Lookahead design: Only a one-chunk KV lookahead buffer is studied. What is the optimal lookahead length, can it be adaptive, and how does it trade off quality vs. memory/latency across tasks and base models?
Constant vs. adaptive compression: The method uses a fixed t per layer and chunk. Can dynamic, content-aware allocation of compact slots across layers, heads, and chunks improve utilization under a fixed total budget?
Inter-layer coordination: Compactors are trained per layer with no explicit inter-layer communication. Would cross-layer channels, hierarchical/stacked compactors, or joint training across layers improve fidelity at high compression?
Bias channel β: The architecture supports per-token bias terms but disables them in all reported results. When, if ever, do learned biases help (single-pass vs. recurrent settings), and what kernel/engineering paths make them practical?
RoPE dependence and position handling: The approach relies on un-rotating and re-rotating RoPE keys. How does it generalize to other positional schemes (ALiBi, learned absolute/relative), mixed schemes, or models without RoPE?
Mixed attention stacks: For sliding-window/global hybrids (e.g., Gemma-3), only global layers are compacted. Can sliding-window layers also be compacted (e.g., via block-level resampling) without harming locality and efficiency?
MoE dynamics: While Qwen MoE transferred, the interaction between compaction and expert routing is not analyzed. Does compaction skew routing distributions or degrade sparse expert utilization in larger/denser MoE settings?
Exact retrieval and needle tasks: The paper notes weaker performance on exact-retrieval tasks; systematic evaluation and targeted improvements (e.g., hybrid selection+synthesis, explicit copy heads) remain open.
Open-ended generation breadth: Free-form evaluation is limited (HELMET multi_lexsum; two LongBench tasks). How does the compact cache affect broad generative workloads (dialogue, tool use, code generation/execution, reasoning chains) and long multi-turn agent traces?
Task/domain/language robustness: Training data is a four-domain, English QA corpus. How well does Still generalize to diverse domains (scientific, medical, multilingual), formats (tables, code diffs), and languages without retraining?
Teacher choice and loss design: KL is computed on top-200 teacher logits from the same base model. What are the effects of (i) full-vocabulary or temperature-tuned distillation, (ii) different teachers or ensembles, (iii) auxiliary reconstruction losses (e.g., attention/output-state matching) on stability and quality?
Generation-aligned training: A small generation-aligned checkpoint is used only for LongBench v1. What systematic recipes best align compaction with open-ended generation quality across tasks, beyond MCQ-oriented KL?
Theoretical error control: There are no formal guarantees on attention-output deviation or error accumulation under recurrence. Can we derive layer-wise or trajectory-wise bounds that guide budget allocation and training?
Interpretability and information routing: What do compact keys/values represent across layers/heads, and how do they route information over time? Analyses analogous to Cartridges’ key/value roles for Still would inform design choices.
Hardware and kernelization: Compaction time is favorable, but practical deployment needs fused kernels for un-rotate/re-rotate, lookahead, and optional bias. What kernel/graph-level optimizations yield end-to-end speedups in real serving stacks (vLLM, PagedAttention)?
Interaction with quantization/pruning: How do weight/activation/KV quantization, sparsity, or KV sharding affect compaction accuracy and speed, and can compactor parameters themselves be quantized without loss?
Online/streaming without full prefill: Single-pass Still requires a full-prefix prefill before compaction. Can we train a streaming variant that compacts on-the-fly during prefill, eliminating the initial full-cache pass?
Budget accounting across heads/layers: t is shared per layer across heads. Would learning per-head budgets, head dropping, or head clustering improve high-compression regimes?
Adaptive chunking policies: Fixed chunk sizes (e.g., 1024/2048/4096) leave potential performance on the table. Can chunk sizes be learned or adapted to input statistics and task demands at inference time?
Cross-model transfer: Each base checkpoint needs its own compactor. Can we train family-shared compactors (e.g., via adapters or meta-learning) that generalize across scales/architectures with minimal tuning?
Safety and robustness: How does compaction affect refusal behavior, hallucination, and adversarial robustness (prompt injection, jailbreaks) in long-context settings?
Persistence and cross-session reuse: Can compact caches be stored, merged, and reused across sessions or documents (e.g., repository-scale memory) without drift or catastrophic interference?
Evaluation breadth and statistics: Some key results (iterative, HELMET) are single-seed. Larger-scale, multi-seed, cross-hardware evaluations and stronger human studies across more tasks would solidify claims about robustness and utility.
Benchmark parity and SOTA baselines: A full-matrix comparison against the latest query-agnostic synthesis methods (e.g., KVzip beyond partial checks) and recent memory architectures would clarify the frontier at extreme lengths/ratios.

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper introduces Still, a lightweight, per-layer Perceiver-based module that compresses an LLM’s key–value (KV) cache into a compact representation in a single forward pass. Trained once against a frozen base model (amortized synthesis), Still retains much of the utility of the full cache at 8x–200x compression, supports both single-pass and iterative (streaming) compaction, and transfers across model scales and attention architectures (e.g., Qwen3 family, Gemma-3 global-attention layers). This enables practical memory and latency reductions for long-context use while preserving accuracy better than selection-based methods in many regimes.

Below are actionable applications derived from these findings, grouped by deployment readiness.

Immediate Applications

These can be implemented now with open-weight models and existing serving stacks that expose KV-cache access/injection.

Memory-efficient long-document question answering and summarization
- Sectors: software, media/journalism, legal, finance, education
- Tools/workflows:
- Single-pass compaction after “prefill” of a long document; reuse the compact cache to answer multiple questions or generate summaries at reduced memory/computation cost.
- Document summarizer endpoints that compact the prefix before generation (supported by HELMET and LongBench results).
- Assumptions/dependencies:
- Access to open-weight checkpoints (e.g., Qwen3, Gemma-3) and KV-cache injection in serving frameworks (e.g., vLLM/HF Transformers).
- Per-checkpoint compactor training (paper uses ~1B long-context tokens for 8k; longer horizons require more).
- Quality depends on matching compression and context length to trained regimes; extreme compression at very long lengths degrades utility.
Repository-scale code assistants under constrained memory
- Sectors: software/DevTools
- Tools/workflows:
- Prefill codebase or large subsets, compact the cache, then answer developer queries or perform code review across large contexts with reduced GPU VRAM.
- Assumptions/dependencies:
- Code LLM with accessible KV caches; repository ingestion pipeline to prefill in chunks and compact iteratively for very large repos.
- Training compactor on code-heavy long-context data improves domain performance.
Multi-tenant LLM serving with lower memory footprint
- Sectors: cloud/ML platforms, enterprise IT
- Tools/workflows:
- Integrate Still into inference servers to replace large prefix caches with compact caches, increasing concurrency and reducing cost.
- Offer “memory-optimized” routes for long-context jobs (e.g., 32k–128k).
- Assumptions/dependencies:
- Engineering to support RoPE un-rotation/re-rotation and KV replacement in the attention stack.
- Monitoring to choose compression ratios that meet SLA targets without unacceptable quality loss.
Streaming/ongoing conversation and log analysis with iterative compaction
- Sectors: customer support, IT operations, security, data ops
- Tools/workflows:
- Iterative chunked compaction with lookahead for long-running chats or logs: compact earlier chunks as new chunks arrive, maintaining a fixed compression ratio and bounded growth.
- Assumptions/dependencies:
- Training horizon should match expected deployment lengths (paper shows 16k/32k-trained compactors generalize better to 128k than 8k-trained).
- Pipelines must sequence chunk prefill and compaction as described (lookahead buffer).
Meeting and call transcript summarization at scale
- Sectors: enterprise productivity, sales/CS tools
- Tools/workflows:
- Prefill transcript, compact, then generate action items or summaries; reuse compact cache for follow-up questions across multiple participants.
- Assumptions/dependencies:
- Quality verified on summarization benchmarks; production needs fine-tuning or compactor training on meeting data for best performance.
Long-form reading assistants and study tools
- Sectors: education, consumer apps
- Tools/workflows:
- On-device or cloud-assisted reading apps that prefill long materials (chapters, reports), compact, and then support rich Q&A and summarization within practical memory budgets.
- Assumptions/dependencies:
- Open-weight models with acceptable on-device performance or hybrid local/cloud design.
- Compactor training on educational content improves outcomes.
Legal and regulatory document triage
- Sectors: legal, gov/policy, compliance
- Tools/workflows:
- Compact KV caches for filings, contracts, or cases; support triage Q&A and auto-generated briefs while retaining evidence better than token selection methods at tight budgets.
- Assumptions/dependencies:
- Domain calibration and validation to ensure critical passages are preserved; audit trails for lossy compaction.
Finance and earnings-call analysis
- Sectors: finance
- Tools/workflows:
- Prefill filings/transcripts, compact, and run downstream analytics, summarization, or compliance checks across large contexts cost-effectively.
- Assumptions/dependencies:
- Risk controls for lossy compression; human-in-the-loop review for material disclosures.
Integrations with existing serving stacks and kernels
- Sectors: ML tooling/software
- Tools/workflows:
- Drop-in “KV compaction plugin” per layer that runs after prefill (single-pass) or per chunk (iterative).
- Optional per-token biases are supported by the design but disabled in reported results; production use requires attention-kernel support.
- Assumptions/dependencies:
- Engineering effort to expose cache read/write, RoPE manipulation, and position offset accounting.
- Benchmarks to select compression ratios and training seeds; observability for speed–quality trade-offs.
Research enablement for long-context experiments on modest hardware
- Sectors: academia, R&D
- Tools/workflows:
- Use Still to run long-context RULER/QUALITY-style experiments with reduced hardware demands.
- Compare amortized synthesis vs. selection methods under matched training/eval for new tasks.
- Assumptions/dependencies:
- Availability of suitable long-context training data; reproducible evaluation harnesses.

Long-Term Applications

These require additional research, scaling, kernel support, or training at longer horizons before broad deployment.

Million-token contexts and beyond for long-horizon agents
- Sectors: software agents, robotics, operations automation
- Tools/workflows:
- Extend iterative compaction training horizons (curriculum over length) to support multi-day or million-token trajectories without collapse.
- Assumptions/dependencies:
- Significant training data/compute for extended horizons; improved recurrence designs (e.g., constant-budget, in-place slot reuse).
Constant-budget recurrent memory
- Sectors: agent frameworks, embedded/edge systems
- Tools/workflows:
- Develop in-place slot reuse/merge policies to keep total cache O(1) rather than growing linearly with trajectory length.
- Assumptions/dependencies:
- New algorithms and training objectives to avoid error accumulation under continual overwrites.
General-purpose compactors for diverse generation tasks
- Sectors: creative tools, code generation, open-ended assistants
- Tools/workflows:
- Train compactors on broader generation objectives and datasets to improve beyond extractive MCQ/summarization.
- Assumptions/dependencies:
- Task coverage in training data; evaluation on diverse open-ended benchmarks to verify utility.
Cross-model or family-level compactors
- Sectors: ML platforms, enterprises managing multiple models
- Tools/workflows:
- Explore compactor sharing across closely related checkpoints (e.g., a family of RoPE-based models) to reduce per-model training overhead.
- Assumptions/dependencies:
- Architectural compatibility; methods to adapt to differences in head geometry and attention stacks.
Native kernel and hardware acceleration for compaction
- Sectors: silicon vendors, systems
- Tools/workflows:
- Fused ops for RoPE un-rotation/re-rotation, cross-attention with compact latents, and efficient cache injection; support for optional bias channels.
- Assumptions/dependencies:
- Collaboration with framework and hardware communities; performance validation at scale.
Secure and compliant compaction for regulated domains
- Sectors: healthcare, finance, public sector
- Tools/workflows:
- Auditable compaction pipelines (logging of compression steps, versioned compactors), risk assessments for evidence loss, and calibration tools.
- Assumptions/dependencies:
- Domain-specific validation sets; policy guidance on acceptable lossy compression in regulated workflows.
Memory-aware RAG hybrids
- Sectors: enterprise search, knowledge management
- Tools/workflows:
- Use compact caches as a mid-layer “working memory” alongside retrieval augmentations, enabling larger effective context at lower cost.
- Assumptions/dependencies:
- Interfaces to orchestrate retrieval, compaction, and generation; training to align compaction with retrieval distributions.
On-device long-context assistants
- Sectors: consumer devices, automotive, IoT
- Tools/workflows:
- Compress session history and documents for on-device LLMs with tight memory budgets, supporting private long-horizon interactions.
- Assumptions/dependencies:
- Efficient kernels on mobile/edge hardware; compact base models with accessible caches.
Energy- and cost-aware scheduling in data centers
- Sectors: cloud/ops, sustainability policy
- Tools/workflows:
- Job schedulers that choose compaction strategies dynamically based on energy/cost budgets and acceptable utility loss.
- Assumptions/dependencies:
- Telemetry linking compression ratios to QoS; organizational policies for quality–cost trade-offs.
Bias and safety governance for lossy compaction
- Sectors: policy, ethics, risk management
- Tools/workflows:
- Audits to detect whether compaction disproportionately removes minority-relevant or safety-critical details; guidelines for acceptable compression in safety-critical applications.
- Assumptions/dependencies:
- Benchmark suites covering fairness/safety aspects under compaction; intervention strategies (e.g., sentinel tokens, bias heads) if needed.
Task-specific compactors (domain-adapted compaction)
- Sectors: legal, biomedical, scientific research
- Tools/workflows:
- Train compactors on domain corpora (e.g., case law, EHR notes, scientific articles) to preserve domain-specific evidence under tight budgets.
- Assumptions/dependencies:
- Access to high-quality, compliant domain datasets; rigorous evaluation and human oversight.

Cross-Cutting Dependencies and Assumptions

Base-model access: You need models with accessible KV caches and the ability to inject compact keys/values (open weights or cooperative APIs).
Positional encoding: The method assumes RoPE; adapting to other encodings may require modifications.
Per-model training: Each base checkpoint typically needs its own compactor trained on representative long-context data; performance degrades if deployment contexts exceed trained horizons.
Operational limits: Single-pass mode still requires a full prefill of the original prefix; iterative mode avoids this but is sensitive to training horizon and recurrence design.
Quality boundaries: Extreme compression at very long contexts, exact-retrieval/needle-in-a-haystack tasks, and open-ended generation remain challenging; careful validation is required.

View Paper Prompt View All Prompts

Glossary

AdamW: An optimizer that decouples weight decay from the gradient update to improve training stability. "We train with AdamW [Loshchilov and Hutter, 2019] at learning rate 4 x 10-5"
Amortized selection: Learning a reusable selection policy offline so that selection at inference is a single forward computation rather than per-context optimization. "Amortized selection appeared recently in KV-Distill [Chari et al., 2025],"
Amortized synthesis: Learning a reusable module that synthesizes compact representations (not just selecting subsets) in a single forward pass at inference. "We study the underexplored combination of amortized synthesis of layer-wise KV caches for frozen pretrained models."
Attention Matching: A per-context synthesis method that reconstructs compact caches by matching attention behavior via closed-form steps. "Attention Matching [Zweiger et al., 2026] uses repeat-prefill query extraction, top-k key selection by RMS score, least-squares value reconstruction, and nonnegative least-squares fitting."
Autoregressive transformer: A transformer that generates tokens sequentially, conditioning on previously generated tokens. "Let fe be a frozen autoregressive transformer with L layers, H KV-heads per layer, head dimension d, and rotary position embeddings [Su et al., 2023]."
Bias channel: An optional per-token additive bias term (in log-space) that can adjust attention or logits during compaction. "The implementation also supports the optional log-space bias channel used by Attention Matching [Zweiger et al., 2026],"
Compression ratio: The factor by which the KV cache is reduced (e.g., 8x means the cache is 1/8th its original size). "with compression ratios annotated along each method curve."
Cross-attention: An attention mechanism where a set of latent queries attends to another sequence (e.g., the KV cache) to extract or aggregate information. "a block applies cross-attention from the latents into X"
Feed-forward network (FFN): The position-wise multilayer perceptron sublayer used within transformer blocks. "refines through self-attention and feed-forward sub-layers,"
Forward KL divergence: A direction of Kullback–Leibler divergence used to train a student to match the teacher’s output distribution. "The training signal is forward KL divergence from a full-context teacher to the compact- cache student,"
Frozen base model: A pretrained model whose weights are kept fixed while auxiliary modules are trained around it. "trained once against a frozen base model"
Global attention: Attention layers with an unbounded receptive field over the entire prefix (as opposed to windowed attention). "for architectures with mixed sliding-window and global attention (e.g. Gemma-3), Still applies compaction only to global-attention layers."
H2O: A per-context selection baseline that retains tokens based on attention-derived importance. "Per-context selection: H2O [Zhang et al., 2023] retains tokens by cumulative attention scores from extracted reference queries;"
HELMET: A benchmark for evaluating long-context model performance, including summarization tasks. "On HELMET [Yen et al., 2025] multi_lexsum [Shen et al., 2022], Still preserves most of the full-context summarization gain"
Iterative chunked compaction: A recurrent scheme that incrementally compacts incoming context chunks at a fixed local compression ratio during long-horizon inference. "Figure 1 evaluates iterative chunked compaction (§2.3) on Long-MCQ,"
KV cache: The stored keys and values from transformer attention layers that allow efficient continuation generation. "The KV cache is the memory bottleneck of long-horizon LLM deploy- ment."
KV-Distill: An amortized selection method that learns a token-scoring policy and uses LoRA to route retained tokens through trainable projections. "KV-Distill [Chari et al., 2025] uses a learned token scorer with top-k retention, rank-128. LoRA adaptors on Q, K, V,O, and forced sink tokens with an uncompressed question/answer path."
Latent queries: Learnable vectors that query the full KV cache via cross-attention to produce compact representations. "each maintains a bank of learned latent queries that cross-attends the full KV cache"
LoRA: Low-rank adaptation method that injects trainable low-rank matrices into pretrained layers to enable efficient fine-tuning. "LoRA [Hu et al., 2022] adaptors"
Lookahead: A buffering strategy that keeps a raw chunk of KV cache uncompressed ahead of the current compaction point to reduce error accumulation. "We refer to this buffering mechanism as lookahead."
LongBench v1: A benchmark for open-ended long-context tasks such as summarization. "LongBench v1 [Bai et al., 2024a] GovReport and QMSum, Still also wins 60% of pairwise judge comparisons"
LongBench v2: A benchmark variant focused on multiple-choice long-context understanding. "We include LongBench v2 Bai et al., 2024b in Table 9"
Mixture-of-Experts (MoE): An architecture where multiple expert subnetworks are selectively activated per input token. "the Qwen3-30B-A3B MoE"
Nonnegative least-squares: A constrained regression technique where coefficients are restricted to be nonnegative, used here for bias fitting in attention-matching. "and nonnegative least-squares fitting of per-token biases."
Perceiver: An architecture that uses learnable latent arrays to attend to high-dimensional inputs, serving as the compactor’s backbone. "We introduce Still, a small Perceiver-based [Jaegle et al., 2021, Alayrac et al., 2022] compactor"
Per-context selection: Methods that decide at inference time which original tokens to retain in the compact cache. "Per-context selection [Zhang et al., 2023, Li et al., 2024, Cai et al., 2025, Kim et al., 2025] scores tokens on the fly and evicts the rest,"
Per-context synthesis: Methods that construct new compact entries (not just select tokens) via per-context optimization during inference. "Per-context synthesis [Eyuboglu et al., 2025, Zweiger et al., 2026] lifts this ceiling but requires optimization for each context."
Position-free frame: A representation in which RoPE rotation is removed from keys to separate content from positional phase during compaction. "Still therefore operates in a position-free frame:"
Prefill: The initial forward pass over the prompt/prefix to populate the KV cache before continuation. "which produces a subset of original tokens in a single prefill via a learned scorer plus LoRA [Hu et al., 2022] adaptors."
Prefix-tuning: A method that optimizes virtual tokens prepended to inputs to steer model behavior, used to train compact caches like Cartridges. "trains it via prefix tuning [Li and Liang, 2021]"
QK-norm: L2 normalization applied to queries and keys prior to dot-product attention to stabilize training. "Queries and keys are L2-normalized after the linear projection but before the dot product [QK-norm; Dehghani et al., 2023]."
RMSNorm: A normalization method that rescales activations by their root mean square without learned affine parameters. "norms are RMSNorm [Zhang and Sennrich, 2019]"
RoPE (Rotary Position Embeddings): A positional encoding scheme that rotates key/query vectors to encode relative positions. "rotary position embeddings [Su et al., 2023]"
RULER: A benchmark grid for evaluating long-context capabilities across contexts and cache sizes. "On the long-context RULER grid, Still exceeds the strongest baseline by 8-22 points."
Sink tokens: Special tokens that absorb attention mass to stabilize streaming or long-context decoding. "StreamingLLM [Xiao et al., 2024] keeps 4 attention-sink tokens plus the most recent K - 4 tokens."
Sliding-window attention: Attention restricted to a fixed-size window over recent tokens rather than the entire prefix. "mixed sliding-window/global attention"
StreamingLLM: A streaming inference baseline that maintains a small set of sink and recent tokens in the cache. "StreamingLLM [Xiao et al., 2024] keeps 4 attention-sink tokens plus the most recent K - 4 tokens."
Teacher–student distillation: Training a student (compact-cache) model to match a teacher (full-context) model’s output distribution. "teacher and student are the same frozen base model differing only in whether the prefix cache is the full uncompressed cache or Still's compact cache."
Top-k: Selecting the k highest-scoring items (e.g., tokens or keys) according to a criterion during selection or fitting. "top-k key selection by RMS score,"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

HackerNews

Still: Amortized KV Cache Compaction in a Single Forward Pass (3 points, 0 comments)

Still: Amortized KV Cache Compaction in a Single Forward Pass

Summary

Still: Amortized KV Cache Compaction in a Single Forward Pass

Problem Setting: KV Cache Bottleneck and Compaction Paradigms

Method: The Still Compactor

Empirical Results: Speed-Quality, Long-Context, and Summarization

Speed-Quality Frontier

Model Transferability

Long-Context and Iterative Compaction

Free-form Summarization

Architectural and Training Insights

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

1) What this paper is about (big picture)

2) The key questions the paper asks

3) How the method works (in simple terms)

4) What they found and why it matters

5) What this could change (impact and limits)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-Cutting Dependencies and Assumptions

Glossary

Open Problems

Continue Learning

Collections

Tweets

HackerNews

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research