Papers
Topics
Authors
Recent
2000 character limit reached

Q-Filters: Cache Compression and Quantum Applications

Updated 26 November 2025
  • Q-Filters are projection-based methods that efficiently compress key-value caches in autoregressive transformers, reducing memory usage by up to 32x.
  • They operate by calibrating singular vectors via SVD to rank KV pairs based on geometric projections, ensuring minimal compute and no retraining.
  • Beyond language models, Q-Filters are used in quantum error mitigation and control theory, highlighting their versatile role in filtering and state selection.

A Q-filter is a term that appears—either as an explicit construction or as a shorthand—in several advanced domains across quantum information, classical control, signal processing, and modern machine learning. In the most recent technical literature, “Q-filter” most notably refers to a training-free, projection-based Key-Value (KV) cache compression method for LLMs, designed to efficiently reduce memory usage during inference with negligible loss in generation or retrieval performance (Godey et al., 4 Mar 2025). In other contexts, Q-filters serve as quantum error-mitigation and correction devices (Das et al., 29 Jul 2024), state preparation or eigenstate filtering primitives in quantum algorithms (Karacan et al., 26 Mar 2025, Lee et al., 5 Oct 2025, Sakuma et al., 2 Jul 2025), or as special constructions in safe-control or filter-theoretic frameworks. This article focuses primarily on the Q-filters deployed in attention-based LLMs, while referencing technically parallel roles in other domains.

1. Motivation: KV Cache Compression in LLMs

Modern autoregressive transformers utilize a growing KV cache to avoid recomputation of hidden-state projections during token generation. For context length LL and model dimension dHd_H, the per-head cache per sequence grows as O(LdH)O(L\,d_H) for HH heads, and practical deployments see hundreds of thousands of tokens or more. Such unbounded cache growth creates hardware bottlenecks:

  • GPU high-bandwidth memory (HBM) is rapidly exhausted,
  • CPU↔GPU shuttling adds latency/overhead,
  • Commodity platforms cannot scale to typical long-context windows.

KV compression methods seek to selectively drop or merge past KV pairs while minimizing output degradation. Traditional heuristics, such as dropping keys by magnitude or using attention-weight proxies, can either be computationally expensive (requiring access to attention matrices) or sub-optimal at high compression.

Q-Filters provide an efficient, projection-based alternative that achieves strong empirical performance and preserves compatibility with FlashAttention- and attention-weight-opaque kernels (Godey et al., 4 Mar 2025).

2. Geometric Principle: Query-Key Drift and Dominant Projection

The crucial empirical observation underlying Q-Filters is that, within each attention head hh, the distributions of queries (QihQ^h_i) and keys (KjhK^h_j) drift along a shared, single dominant direction uhSdH1u^h \in \mathbb{S}^{d_H-1}. More precisely:

Observation 1: There exists uhu^h (up to sign) such that

EiQih,uh>0,EjKjh,ϵuh>0\mathbb{E}_i \langle Q^h_i, u^h \rangle > 0, \quad \mathbb{E}_j \langle K^h_j, \epsilon u^h \rangle > 0

for some ϵ{±1}\epsilon \in \{\pm 1\}.

Observation 2: In an orthonormal basis {uh,u2h,,udHh}\{u^h, u^h_2, \dots, u^h_{d_H}\}, only the component along uhu^h has nonzero mean over the distribution of queries and keys.

This leads to the approximation: EiQih,KjhκhKjh,uh\mathbb{E}_i \langle Q^h_i, K^h_j \rangle \approx \kappa^h \langle K^h_j, u^h \rangle for some positive κh\kappa^h. In practice ϵ=1\epsilon=-1 is common, so keys with large negative projections on uhu^h are unlikely to be relevant for future queries.

3. Algorithm: SVD-based Context-Agnostic Projection and Score Filtering

The Q-Filter construction is as follows [(Godey et al., 4 Mar 2025), Algorithm]:

  1. Calibration Step: For each attention head hh, extract a modest batch (e.g., n103n \sim 10^3) of queries Qh=[Q1h;;Qnh]Rn×dH\mathcal{Q}^h = [Q^h_1; \dots; Q^h_n] \in \mathbb{R}^{n \times d_H}. Compute the right singular vectors via SVD:

Qh=UΣV,V=[v1h,v2h,,vdHh]\mathcal{Q}^h = U \Sigma V^\top,\quad V = [v^h_1, v^h_2, \dots, v^h_{d_H}]

Set the Q-filter vector vh±v1hv^h \equiv \pm v^h_1 such that vh,mean(Kjh)>0\langle v^h, \text{mean}(K^h_j)\rangle > 0.

  1. At Run-Time: For each new key KthK^h_t in head hh, compute the scalar score

sth=Kth,vhs^h_t = \langle K^h_t, v^h \rangle

and insert the tuple (Kth,Vth)(K^h_t, V^h_t) into a cache with this priority. When the cache exceeds a fixed budget, evict the lowest-priority pairs.

Crucially, this approach requires only one dot product per head per token, does not materialize attention maps, and introduces negligible compute overhead.

4. Architecture Compatibility, Computational Scaling, and Storage Overhead

Q-Filters only interact with the KV cache through key projections and do not need attention weights Ah=(Qh(Kh))A^h = (Q^h (K^h)^\top), making them compatible with FlashAttention- and other kernels that hide or fuse the QK-attention implementations. The computational cost per token: O(HdH)O(H\,d_H) —identical to the KK-norm heuristic—and the additional storage for all vhv^h vectors is HdHH\,d_H floats (tens of thousands at most), negligible compared to model size.

This scheme enables cache reductions by factors of up to 32×32\times (i.e., storing L/32L/32 keys per head), even at high model and context scales.

5. Comparative Empirical Performance

Empirical results for Q-Filters demonstrate robust retention of model performance in both information-retrieval and language modeling tasks at high compression levels:

Task/Benchmark Compression Q-Filters KK-Norm SnapKV Streaming-LLM
Needle-in-a-Haystack Retrieval ×\times32 99% accuracy \sim63% \sim99% n/a
Streaming LLM (Llama-3.1 8B, 70B) 512-entry -65% Δ\DeltaPPL vs. Streaming-LLM Baseline -55% Baseline
RULER Long-Context Benchmark ×\times32 <2<2\% loss vs. uncompressed Baseline Similar Baseline

Notably, Q-Filters consistently outperform efficient schemes like Streaming-LLM in generation settings and match leading alternatives like SnapKV for retrieval, while retaining compatibility with high-speed attention implementations (Godey et al., 4 Mar 2025).

6. Implementation and Usage Workflow

For each attention head hh in a deployed model:

  1. Offline: Precompute vhv^h using a single SVD calibration run.
  2. Serving Loop:
    • For each token tt:
      • Compute (Kth,Vth)(K^h_t, V^h_t),
      • Calculate sth=Kth,vhs^h_t = \langle K^h_t, v^h \rangle,
      • Insert into cache by priority sths^h_t,
      • If over budget, evict lowest-scoring KV pairs.

The method admits flexible compression: one may keep a fixed quota of top-kk entries or drop a fixed fraction per step. The only hyperparameter is the cache size or fraction of keys retained.

Q-Filters do not require retraining, are fully training-free, and can be retrofitted to any standard transformer model. They introduce only minor per-step computational and code overhead, independently of sequence length.

The Q-filter terminology also arises in:

  • Quantum Error Mitigation: “Quantum filters” can denote commutation-derived superchannels that probabilistically suppress or convert noise types in quantum circuits, enabling error purification and deterministic correction without syndrome-based QEC (Das et al., 29 Jul 2024).
  • Quantum Eigenstate Filtering: Filters in quantum simulation (e.g., QETU, polynomial/Krylov filters) amplify overlap with a target eigenspace, either pre-conditioning phase estimation or post-selecting outputs for high-fidelity ground state preparation (Karacan et al., 26 Mar 2025, Lee et al., 5 Oct 2025, Sakuma et al., 2 Jul 2025).
  • Control Theory and Safety Filters: Q-functions whose zero sublevel sets mark safe controls in reinforcement learning—“Q-filters”—can be trained and formally verified for safe control via reachability and multiplicative Q-networks (Li et al., 27 May 2025).
  • Quantum Invariant Filters: In quantum control, invariant-based driving protocols translate arbitrary finite-impulse response (FIR) filter profiles into laboratory-frame control fields, offering high spectral selectivity and coherence (Cangemi et al., 18 Jun 2025).

Despite diversity in mechanism and domain, all Q-filters share the paradigm of selective information preservation or projection—operationalized as geometric, spectral, or algebraic filtering—applied at the model, state preparation, or control layer for performance and efficiency advantages.


References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Q-Filters.