Papers
Topics
Authors
Recent
Search
2000 character limit reached

CompAct Compression Scheme

Updated 21 April 2026
  • CompAct compression scheme is a data reduction technique that optimizes Transformer inference by retaining essential tokens via approximate leverage scores and attention metrics.
  • It employs a randomized sketching protocol with Gaussian matrices to efficiently estimate token importance, preserving key spectral properties with minimal overhead.
  • Empirical benchmarks show that CompAct achieves significant memory savings with up to 4.5× compression and negligible performance loss, making it ideal for long-context models.

The CompAct compression scheme designates a family of modern data reduction techniques, each targeting domain-specific memory or storage bottlenecks in machine learning and data processing workloads. Recent research identifies three distinct instantiations in the literature: CompAct for lossless DICOM image compression with fractal block heuristics (Khan, 2023), CompAct for memory-efficient LLM activation compression via random projection (Shamshoum et al., 2024), and CompAct for lossless, query-agnostic Key/Value (KV) cache pruning in Transformer inference via approximate leverage scores (Chari et al., 10 Jul 2025). This article focuses on the CompAct technique for query-agnostic, parameter-free KV cache compression, with necessary contrasts to related paradigms.

1. Formal Problem Definition and Motivation

Let a Transformer’s context memory at inference be composed of key and value matrices, K∈RT×dK\in\mathbb{R}^{T\times d} and V∈RT×dV\in\mathbb{R}^{T\times d}, where TT is context length and dd the hidden width per attention head. The KV cache’s O(TdT d) memory cost becomes prohibitive for large context TT, particularly in models supporting long-context generation. The challenge is to evict up to $1-r$ fraction of tokens (i.e., retain k=rT<Tk=rT < T tokens), forming compressed matrices KS,VSK_S, V_S with ∣S∣=k|S|=k, such that the decoder distribution V∈RT×dV\in\mathbb{R}^{T\times d}0 induced by the compressed cache V∈RT×dV\in\mathbb{R}^{T\times d}1 remains close to the full model for all possible queries V∈RT×dV\in\mathbb{R}^{T\times d}2. The formal objective is to minimize the increase in downstream negative log-likelihood (NLL) or KL-divergence,

V∈RT×dV\in\mathbb{R}^{T\times d}3

while maximizing compression V∈RT×dV\in\mathbb{R}^{T\times d}4 and ensuring minimal extra computation per generation step (Chari et al., 10 Jul 2025).

2. Leverage Score Theory and Randomized Approximations

Statistical leverage scores are central to the CompAct KV compression algorithm. For a full-rank V∈RT×dV\in\mathbb{R}^{T\times d}5, the leverage score for token V∈RT×dV\in\mathbb{R}^{T\times d}6 is

V∈RT×dV\in\mathbb{R}^{T\times d}7

with V∈RT×dV\in\mathbb{R}^{T\times d}8 its singular value decomposition. Tokens with high V∈RT×dV\in\mathbb{R}^{T\times d}9 span high-variance directions of the row space and are essential to the attention mechanism.

Exact computation is infeasible at large TT0; thus, CompAct adopts a randomized sketching protocol. Given a Gaussian sketch matrix TT1 (with TT2), one computes TT3 and performs SVD on TT4 to obtain approximate left singular vectors TT5 and their squared norms TT6. Theoretical results (Theorems 2.1 and 2.2) guarantee that with

TT7

these approximate leverage scores are multiplicatively close to the exact values up to a matrix condition number factor, thus preserving the spectral properties required for attention fidelity (Chari et al., 10 Jul 2025).

3. CompAct Algorithm: Outlier, Attention, Blending, and Context Calibration

CompAct executes a multi-stage scoring and token eviction chain:

  • Outlier scoring: For each token, compute its approximate leverage score using the randomized sketch, as detailed above.
  • Attention scoring: Employ non-causal self-attention blocks. Context is split into TT8-sized blocks; each computes full attention TT9, and token scores dd0 are concatenated. This ensures retention of tokens frequently attended in any possible window, not just the tail of the sequence.
  • Score blending: Normalize both outlier (dd1) and attention (dd2) vectors to zero mean and unit variance, then form a blended score,

dd3

with heuristically tuned dd4.

  • Eviction: Retain the top dd5 tokens by blended score dd6. This operation is performed independently per head and per layer.
  • Context-calibrated compression: Empirically fit a function dd7 modeling the NLL gain at a given compression rate dd8 for context dd9, by token-level evaluation of

TdT d0

For user-specified gain threshold TdT d1, solve for TdT d2 and set TdT d3 in the eviction stage (Chari et al., 10 Jul 2025).

4. Theoretical Properties and Computational Complexity

Leverage-score sampling is proven to preserve the covariance matrix TdT d4 up to TdT d5 factors when TdT d6 rows are sampled proportionally to TdT d7. For randomized sketching, the approximations to TdT d8 are sufficiently tight for practical TdT d9 or TT0, ensuring that the selection procedure yields a submatrix whose spectral properties are preserved up to condition-number-dependent scaling.

Computationally, CompAct’s dominant costs per cache compression event are:

  • TT1 for TT2 computation,
  • TT3 for TT4 SVD,
  • TT5 for attention block scoring,
  • TT6 for top-TT7 selection (or TT8 via linear-time selection).

For TT9, the overhead is negligible compared to a full attention forward pass and sublinear compared to the cost of windowed baselines at large $1-r$0 (Chari et al., 10 Jul 2025).

5. Empirical Performance and Benchmarks

CompAct demonstrates substantial KV cache savings with minimal loss in generative performance across synthetic (RULER) and real-world (LongBench) benchmarks:

  • At 75% retention, CompAct achieves $1-r$1 of full-KV performance; at 50%, $1-r$2; at 10%, $1-r$3 (RULER, Llama3.1).
  • In LongBench, at 50% retention, CompAct matches or improves upon full-KV scores across task families, while SnapKV and PyramidKV degrade at 25% retention to $1-r$4 of baseline versus CompAct’s $1-r$5.
  • Context-calibrated compression achieves $1-r$6 (Llama3.1, zero-shot) and $1-r$7 (finetuned) compression at $1-r$8, with negligible NLL increase.
  • Ablations confirm that removing attention scores ($1-r$9) leads to a significant drop (from k=rT<Tk=rT < T0 to k=rT<Tk=rT < T1 at k=rT<Tk=rT < T2 retention), and approximate versus exact leverage computation yields changes k=rT<Tk=rT < T3. Removal of either mechanism significantly degrades quality.
  • Memory savings are linear in KV compression ratio and latency overhead is competitive or better than prior query-agnostic methods for k=rT<Tk=rT < T4K (Chari et al., 10 Jul 2025).

6. Connections, Distinctions, and Domain Scope

The CompAct label is also used for GPU activation compression via random projections (Shamshoum et al., 2024). That method, in contrast, replaces activations k=rT<Tk=rT < T5 with sketches k=rT<Tk=rT < T6 (k=rT<Tk=rT < T7), substantially reducing training memory footprint (25–30% in pretraining, 50% in finetuning), while incurring only minimal approximation error due to Johnson–Lindenstrauss and subspace embedding theorems. The connection is that both paradigms utilize randomized sketching for rank reduction and subspace preservation, but the LLM-activation variant compresses compute graphs during training, not inference-time KV caches.

The original usage under the title "CompaCT" refers to lossless medical image compression, leveraging fractal Hilbert-curve ordering, 4x4 block segmentation plus selective meshing, delta-plus-entropy coding, and DEFLATE. In that setting, "CompaCT" achieves a k=rT<Tk=rT < T8 average compression ratio (37% better than DICOM JPEG2000-lossless) with provably zero RMSE and full invertibility (Khan, 2023).

7. Summary Table: CompAct Family Overview

CompAct Variant Domain Core Mechanism
CompAct (KV cache) LLM inference, context window Approx. leverage & global attn.
CompAct (activations) LLM training Random projection sketching
CompaCT (medical) DICOM lossless imaging Hilbert fractal, delta, block meshing

All schemes described under the CompAct and CompaCT label leverage domain-adapted randomized subspace methods, spectral preservation guarantees, and demonstrable compression with typically negligible or quantifiable loss in downstream task fidelity. CompAct KV compression is distinguished by its query-agnostic robustness, theoretical spectral guarantees, empirical superiority at high compression ratios, and minimal runtime overhead, making it suitable for long-context LLM deployment at scale (Chari et al., 10 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompAct Compression Scheme.