Papers
Topics
Authors
Recent
Search
2000 character limit reached

Query-Agnostic Compression and Reuse Algorithms

Updated 23 February 2026
  • Query-agnostic compression algorithms are techniques that reduce memory by compressing contextual data without relying on future query information.
  • They employ methods such as context reconstruction, leverage score estimation, and autoencoders to achieve significant memory reduction while maintaining performance.
  • Empirical studies demonstrate substantial speedups and memory savings, with approaches like KVzip achieving up to 70% cache eviction and Compactor reducing memory usage by 63%.

Query-agnostic compression and reuse algorithms are a class of memory reduction techniques that enable efficient storage and utilization of contextual data, such as prompts or cached hidden states, without relying on knowledge of downstream queries. In contrast to query-aware methods that require access to the user’s target question or task at compression time, query-agnostic approaches determine what to compress or evict solely from the structure and content of the context itself. This paradigm is fundamental for high-throughput or multi-turn settings where compressed representations must be reused across diverse queries, batch inference, retrieval, or streaming modalities. Recent advances span transformer KV-cache compression, prompt/token selection, and sample-compression schemes, with rigorous empirical and theoretical characterizations across domains including language, code, and vision.

1. Core Principles and Motivations

Modern large language and multimodal models, especially in transformer architectures, cache key–value (KV) pairs or extended prompts to facilitate autoregressive inference over long contexts. As context lengths increase to hundreds of thousands of tokens, the naive KV cache often dwarfs model parameters in memory footprint, impeding throughput, batch size, or even feasibility on commodity accelerators (Kim et al., 29 May 2025, Roy et al., 7 Dec 2025, Chari et al., 10 Jul 2025, Yang et al., 21 Aug 2025). Traditional query-aware compression methods (e.g., SnapKV, PyramidKV, H₂O) score cached context using attention statistics derived from a known query, performing well only when the query is fixed and fail dramatically when reused across queries (Kim et al., 29 May 2025). The necessity for amortized compression—where a context or prompt is compressed once and safely reused for arbitrary (possibly unseen) queries—defines the query-agnostic regime.

Key objectives of query-agnostic methods include:

  • Query-independence: Compression decisions do not leverage future queries; outputs are stable regardless of which task is asked post-compression.
  • Reusability: Compressed representations (e.g., caches/prefixes/prompts) can be shared across many queries, critical for multi-turn, RAG, batched serving, or streaming setups (Chari et al., 10 Jul 2025, Yang et al., 21 Aug 2025, Liskavets et al., 19 Feb 2025, Pan et al., 2024).
  • Minimal performance loss: Compression should maintain accuracy, faithfulness, and generalization within tight tolerances relative to the uncompressed baseline.

2. Algorithms for Query-Agnostic Compression

2.1 Context Reconstruction–Based Approaches

KVzip (Kim et al., 29 May 2025) employs a context reconstruction simulation: a repeat prompt is appended to the cached context and a forward pass is issued to recover the original tokens. The maximal attention received by each KV slot during context regeneration is used as a proxy for importance; the top-rank slots are retained across all heads/layers, and the remaining are evicted. This “one-shot” process creates a compressed cache supporting all future queries with negligible (<1%) loss across QA, reasoning, and code tasks. The method requires no query awareness and supports efficient reuse, with empirical results showing up to 70% KV eviction (3–4× compression), 2× speedup in FlashAttention latency, and robustness to context lengths up to 170K tokens.

2.2 Leverage Score– and Geometry–Driven Methods

Compactor (Chari et al., 10 Jul 2025) introduces a parameter-free, query-agnostic compression framework using approximate leverage scores—a geometric measure of matrix row (token) importance. By computing a right random projection (sketch) of the key matrix, Compactor efficiently estimates token leverage, blending it with non-causal (queryless) self-attention scores to form an overall importance metric. Tokens with the highest combined score are preserved. Compactor further integrates a context-calibrated model to predict allowable retention ratios per context via NLL fitting, yielding adaptive, performance-guaranteed compression. Compactor consistently outperforms query-aware and random baselines, achieving up to 63% memory reduction with no performance degradation at typical retention rates.

2.3 Autoencoder and Reuse Mechanisms

KV-CAR (Roy et al., 7 Dec 2025) achieves compression by learning lightweight, layerwise autoencoders to reduce the dimensionality of KV tensors before cache storage, and reconstruct them on retrieval. Query independence is preserved since the autoencoder is trained on context data alone. A complementary reuse strategy examines headwise similarity (using L₁ or cosine distance) between adjacent layers; highly similar heads are deduplicated, referencing previously stored entries. This combination delivers up to 47.85% memory reduction with minimal perplexity or accuracy loss across GPT-2 and TinyLLaMA, again independent of the query.

2.4 Attention-Weighted and Proxy-Token Methods

StreamMem (Yang et al., 21 Aug 2025), designed for streaming multimodal video, maintains a fixed-size visual KV cache by calculating saliency from generic template queries (proxy tokens) to incoming frame/token representations. Top-scoring tokens (by cross-attention) are preserved, with periodic merging of similar frames and prototyping to enforce temporal representation. All compression proceeds in a query-agnostic, streaming fashion; at question time, arbitrary textual queries are executed directly against the compressed visual cache, achieving 8–10× memory savings and state-of-the-art performance in several streaming QA benchmarks.

2.5 Task-Agnostic Prompt Compression

LLMLingua-2 (Pan et al., 2024) and Task-agnostic Prompt Compression (TPC) (Liskavets et al., 19 Feb 2025) address textual prompt reduction through extractive token/classification or learned sentence-relevance metrics, respectively. LLMLingua-2 uses a distillation approach: a supervised labeler is trained from LLM-generated compressed targets, casting compression as a token-preservation binary classification. TPC incorporates an autoregressive task descriptor and context-aware sentence embeddings, synthesizing a synthetic task from the prompt and matching sentences via embedding similarity. Both frameworks achieve 2×–15× reductions with robust downstream task generalization, and are entirely query-agnostic, supporting delayed or batch querying.

3. Formulations and Theoretical Foundations

Query-agnostic compression can be formalized for data classes F\mathcal{F} with loss LL as determining κ(S)\kappa(S) (compression of sample SS) and ρ(κ(S))\rho(\kappa(S)) (reconstruction) such that

  • κ\kappa depends only on the input (not the query or downstream task).
  • L(ρ(κ(S)),S)inffFL(f,S)+αL(\rho(\kappa(S)), S) \leq \inf_{f\in\mathcal{F}} L(f, S) + \alpha (approximate optimality).

In the regression case, Attias et al. (Attias et al., 2018) provide boost-and-sparsify schemes for real-valued classes under p\ell_p-loss, yielding α\alpha-approximate agnostic compression of size determined by the fat-shattering dimension, and exact schemes of size O(d)O(d) for linear regression under 1\ell_1, \ell_\infty. No bounded-size exact scheme is possible for 1<p<1<p<\infty.

For transformer-based models, metric computation (e.g., attention or leverage scores) must be feasible without knowledge of QQ (the query matrix), using only cached K,VK,V pairs or proxy queries. Context calibration, as in Compactor (Chari et al., 10 Jul 2025), fits a function to model expected NLL increases under compression, automatically sizing the compressed cache per context.

4. Empirical Performance and Comparative Analysis

Recent benchmarks across text and multimodal domains establish the superiority of query-agnostic methods in scenarios with multi-query reuse and large context reuse. Notably:

Method Domain Max Compression Perf. drop Speedup Query Awareness? Reuse Support
KVzip LLM KV 3–4× <1% No Yes
Compactor LLM KV 63% mem red. None (full or 93% retained) n/a No Yes
KV-CAR LLM KV 47.85% <0.02 Up to 4× larger batch No Yes
StreamMem Vision LLM 8–10× none or slight n/a No Yes
LLMLingua-2 Text prompt up to 15× minimal or none 2–3× No Yes
TPC Text prompt 5–6× SOTA n/a No Yes

Qualitatively, query-agnostic approaches such as KVzip and Compactor substantially outperform query-aware baselines (SnapKV, PyramidKV, H₂O) when compressed caches are reused on unspecific or multi-turn queries (Kim et al., 29 May 2025, Chari et al., 10 Jul 2025). Baselines tuned for one query often collapse (e.g., SQuAD accuracy drops from 93% to 35% for SnapKV when the cache is reused (Kim et al., 29 May 2025)). Context reconstruction and leverage-based algorithms maintain >95% retention of original performance even under aggressive compression and repeated reuse.

5. Applications, Integrations, and Limitations

Applications

  • Prompt caching in multi-tenant LLM deployment: High-throughput inference where compressed prefixes must be shared across diverse batched queries, as in vLLM, SGLang, or large RAG systems (Chari et al., 10 Jul 2025).
  • Streaming video/vision: Real-time, memory-constrained understanding of long input streams, with proxy attention-based cache management (Yang et al., 21 Aug 2025).
  • Batch retrieval, multi-turn conversational QA: Cases where a prompt or document is compressed once for anticipated or unknown future queries, maximizing amortized computational savings.
  • Downstream prompt reuse in textual domains: LLMLingua-2 and TPC enable extractive, LLM-distilled prompt compression for RAG/ICL pipelines, eliminating the need for handcrafted templates or query-aware pre-filtering.

Limitations

  • Overhead for short contexts: The computational advantage of SVD or attention replay is amortized only for long sequences (e.g., >16K tokens) (Chari et al., 10 Jul 2025).
  • Proxy limitations: Attention or saliency computed via template or synthetic queries may miss ultra-fine details required by adversarial or highly specialized tasks (Yang et al., 21 Aug 2025).
  • Empirical, not formal, guarantees: For most transformer-based approaches (KVzip, Compactor, KV-CAR), no tight theoretical bounds exist on maximal information retention; theory is limited to special cases, e.g., regression sample compression (Attias et al., 2018).
  • Domain shift and template tuning: Effectiveness may degrade under domain shift if the proxy queries or synthetic task descriptors are out-of-distribution (Liskavets et al., 19 Feb 2025, Yang et al., 21 Aug 2025).

6. Future Directions and Open Questions

  • Parameter-free and hardware-optimized implementations: Development of optimized SVD/sketch kernels to enable efficient plug-in to high-throughput serving.
  • Learned calibration and dynamic policy: Neural or online adaptation for context-specific or user-specified quality-mem tradeoffs (Chari et al., 10 Jul 2025).
  • Extending to multimodal and RAG contexts: Unified cache compression across vision, audio, and text via generalized queryless scoring (Yang et al., 21 Aug 2025).
  • Theoretical characterization and lower bounds: Closing the gap between empirical and formal guarantees, especially relating to information-theoretic minimality and combinatorial dimensions (Attias et al., 2018).
  • Hierarchical and adaptive reuse schemes: Implementation of event-level or segment-based caches employing adaptive budgeting, possibly using learned proxies (Yang et al., 21 Aug 2025).
  • Fine-grained compression for downstream specialization: Integration of coarse prompt/document compression with subsequent adaptive per-query tuning in settings with extreme context or heterogeneous queries (Pan et al., 2024).

References:

KVzip: (Kim et al., 29 May 2025) Compactor: (Chari et al., 10 Jul 2025) KV-CAR: (Roy et al., 7 Dec 2025) LLMLingua-2: (Pan et al., 2024) Task-agnostic Prompt Compression (TPC): (Liskavets et al., 19 Feb 2025) StreamMem: (Yang et al., 21 Aug 2025) Agnostic Compression for Regression: (Attias et al., 2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Agnostic Compression and Reuse Algorithms.