CompAct Compression Scheme
- CompAct compression scheme is a data reduction technique that optimizes Transformer inference by retaining essential tokens via approximate leverage scores and attention metrics.
- It employs a randomized sketching protocol with Gaussian matrices to efficiently estimate token importance, preserving key spectral properties with minimal overhead.
- Empirical benchmarks show that CompAct achieves significant memory savings with up to 4.5× compression and negligible performance loss, making it ideal for long-context models.
The CompAct compression scheme designates a family of modern data reduction techniques, each targeting domain-specific memory or storage bottlenecks in machine learning and data processing workloads. Recent research identifies three distinct instantiations in the literature: CompAct for lossless DICOM image compression with fractal block heuristics (Khan, 2023), CompAct for memory-efficient LLM activation compression via random projection (Shamshoum et al., 2024), and CompAct for lossless, query-agnostic Key/Value (KV) cache pruning in Transformer inference via approximate leverage scores (Chari et al., 10 Jul 2025). This article focuses on the CompAct technique for query-agnostic, parameter-free KV cache compression, with necessary contrasts to related paradigms.
1. Formal Problem Definition and Motivation
Let a Transformer’s context memory at inference be composed of key and value matrices, and , where is context length and the hidden width per attention head. The KV cache’s O() memory cost becomes prohibitive for large context , particularly in models supporting long-context generation. The challenge is to evict up to $1-r$ fraction of tokens (i.e., retain tokens), forming compressed matrices with , such that the decoder distribution 0 induced by the compressed cache 1 remains close to the full model for all possible queries 2. The formal objective is to minimize the increase in downstream negative log-likelihood (NLL) or KL-divergence,
3
while maximizing compression 4 and ensuring minimal extra computation per generation step (Chari et al., 10 Jul 2025).
2. Leverage Score Theory and Randomized Approximations
Statistical leverage scores are central to the CompAct KV compression algorithm. For a full-rank 5, the leverage score for token 6 is
7
with 8 its singular value decomposition. Tokens with high 9 span high-variance directions of the row space and are essential to the attention mechanism.
Exact computation is infeasible at large 0; thus, CompAct adopts a randomized sketching protocol. Given a Gaussian sketch matrix 1 (with 2), one computes 3 and performs SVD on 4 to obtain approximate left singular vectors 5 and their squared norms 6. Theoretical results (Theorems 2.1 and 2.2) guarantee that with
7
these approximate leverage scores are multiplicatively close to the exact values up to a matrix condition number factor, thus preserving the spectral properties required for attention fidelity (Chari et al., 10 Jul 2025).
3. CompAct Algorithm: Outlier, Attention, Blending, and Context Calibration
CompAct executes a multi-stage scoring and token eviction chain:
- Outlier scoring: For each token, compute its approximate leverage score using the randomized sketch, as detailed above.
- Attention scoring: Employ non-causal self-attention blocks. Context is split into 8-sized blocks; each computes full attention 9, and token scores 0 are concatenated. This ensures retention of tokens frequently attended in any possible window, not just the tail of the sequence.
- Score blending: Normalize both outlier (1) and attention (2) vectors to zero mean and unit variance, then form a blended score,
3
with heuristically tuned 4.
- Eviction: Retain the top 5 tokens by blended score 6. This operation is performed independently per head and per layer.
- Context-calibrated compression: Empirically fit a function 7 modeling the NLL gain at a given compression rate 8 for context 9, by token-level evaluation of
0
For user-specified gain threshold 1, solve for 2 and set 3 in the eviction stage (Chari et al., 10 Jul 2025).
4. Theoretical Properties and Computational Complexity
Leverage-score sampling is proven to preserve the covariance matrix 4 up to 5 factors when 6 rows are sampled proportionally to 7. For randomized sketching, the approximations to 8 are sufficiently tight for practical 9 or 0, ensuring that the selection procedure yields a submatrix whose spectral properties are preserved up to condition-number-dependent scaling.
Computationally, CompAct’s dominant costs per cache compression event are:
- 1 for 2 computation,
- 3 for 4 SVD,
- 5 for attention block scoring,
- 6 for top-7 selection (or 8 via linear-time selection).
For 9, the overhead is negligible compared to a full attention forward pass and sublinear compared to the cost of windowed baselines at large $1-r$0 (Chari et al., 10 Jul 2025).
5. Empirical Performance and Benchmarks
CompAct demonstrates substantial KV cache savings with minimal loss in generative performance across synthetic (RULER) and real-world (LongBench) benchmarks:
- At 75% retention, CompAct achieves $1-r$1 of full-KV performance; at 50%, $1-r$2; at 10%, $1-r$3 (RULER, Llama3.1).
- In LongBench, at 50% retention, CompAct matches or improves upon full-KV scores across task families, while SnapKV and PyramidKV degrade at 25% retention to $1-r$4 of baseline versus CompAct’s $1-r$5.
- Context-calibrated compression achieves $1-r$6 (Llama3.1, zero-shot) and $1-r$7 (finetuned) compression at $1-r$8, with negligible NLL increase.
- Ablations confirm that removing attention scores ($1-r$9) leads to a significant drop (from 0 to 1 at 2 retention), and approximate versus exact leverage computation yields changes 3. Removal of either mechanism significantly degrades quality.
- Memory savings are linear in KV compression ratio and latency overhead is competitive or better than prior query-agnostic methods for 4K (Chari et al., 10 Jul 2025).
6. Connections, Distinctions, and Domain Scope
The CompAct label is also used for GPU activation compression via random projections (Shamshoum et al., 2024). That method, in contrast, replaces activations 5 with sketches 6 (7), substantially reducing training memory footprint (25–30% in pretraining, 50% in finetuning), while incurring only minimal approximation error due to Johnson–Lindenstrauss and subspace embedding theorems. The connection is that both paradigms utilize randomized sketching for rank reduction and subspace preservation, but the LLM-activation variant compresses compute graphs during training, not inference-time KV caches.
The original usage under the title "CompaCT" refers to lossless medical image compression, leveraging fractal Hilbert-curve ordering, 4x4 block segmentation plus selective meshing, delta-plus-entropy coding, and DEFLATE. In that setting, "CompaCT" achieves a 8 average compression ratio (37% better than DICOM JPEG2000-lossless) with provably zero RMSE and full invertibility (Khan, 2023).
7. Summary Table: CompAct Family Overview
| CompAct Variant | Domain | Core Mechanism |
|---|---|---|
| CompAct (KV cache) | LLM inference, context window | Approx. leverage & global attn. |
| CompAct (activations) | LLM training | Random projection sketching |
| CompaCT (medical) | DICOM lossless imaging | Hilbert fractal, delta, block meshing |
All schemes described under the CompAct and CompaCT label leverage domain-adapted randomized subspace methods, spectral preservation guarantees, and demonstrable compression with typically negligible or quantifiable loss in downstream task fidelity. CompAct KV compression is distinguished by its query-agnostic robustness, theoretical spectral guarantees, empirical superiority at high compression ratios, and minimal runtime overhead, making it suitable for long-context LLM deployment at scale (Chari et al., 10 Jul 2025).