ComprExIT: Soft Context Compression in LLMs
- ComprExIT is a framework for soft context compression in LLMs that transmits explicit information over frozen hidden states to optimize context representation.
- It employs a two-stage mechanism: depth-wise aggregation across layers and width-wise optimal transport for coordinated slot allocation.
- ComprExIT achieves state-of-the-art long-context QA performance with <1% additional parameters, reducing computational complexity compared to full self-attention.
ComprExIT is a framework for soft context compression in LLMs, formulated as explicit information transmission over frozen LLM hidden states. It addresses key limitations of prior LLM-as-compressor paradigms—namely, representation overwriting across layers and uncoordinated allocation of compression capacity over tokens—by introducing a two-stage, globally coordinated transmission scheme: depth-wise transmission (across layers) and width-wise transmission (across token anchors to compression slots) using optimal transport. ComprExIT achieves state-of-the-art performance in long-context question answering under aggressive compression ratios, with negligible (<1%) additional parameters and without fine-tuning the base LLM (Ye et al., 3 Feb 2026).
1. Motivation and Problem Setting
LLMs incur quadratic time and memory complexity in self-attention over long input sequences of length , imposing practical bottlenecks in inference and prompting the need for context compression. Traditional context compression paradigms include:
- Hard compression: Token selection or pruning; efficient but yields significant information loss under high compression.
- Soft compression: Insertion of continuous “gist” or “memory” tokens, relying on trainable attention to summarize the context within the LLM.
Prior “LLM-as-compressor” approaches exhibit two intertwined structural deficits. First, compression tokens progressively overwrite their state at each layer, leading to drift in representation and suboptimal context aggregation for the decoder. Second, independent token-wise aggregation leads to redundancy or omission—the allocation of compression bandwidth is not globally optimized, and information is neither uniquely nor sufficiently assigned to each compression token.
ComprExIT reframes the compression problem as explicit information transmission over cached hidden states from a frozen LLM backbone, decoupling compression from the internal self-attention mechanism and allowing precise, coordinated context selection and aggregation.
2. The ComprExIT Framework
ComprExIT operates in two coordinated stages:
2.1 Depth-wise Transmission: Layer-to-Anchor Aggregation
Given all intermediate hidden states for each token and layer , the goal is to aggregate information across the layers for each position into a single, information-rich token anchor .
Formally,
where softmax-normalized layer attentions are computed via gating over projected layer features and learnable layer embeddings, with temperature for control. The scoring function uses both a structural mixture of all layers and explicit layer features: 0 This mechanism allows the anchor vector to select the layer best aligned with decoder expectations and mitigates cumulative information loss from repeated layer updates.
2.2 Width-wise Transmission: Anchor-to-Slot Allocation via Optimal Transport
The 1 token anchors 2 are mapped to a reduced set of 3 compression slots 4 using a globally optimized transport plan 5. Anchor-to-slot assignment is posed as an instance of entropy-regularized optimal transport, where utility is given by cosine similarity between projected token and slot features: 6 Each anchor 7 is assigned a sender capacity 8 (via learned gating), and receivers (slots) share uniform capacity. The Sinkhorn-Knopp algorithm yields the soft assignment 9 subject to marginal constraints.
Final slot vectors are computed as capacity-weighted linear projections of anchors, followed by an MLP for dimensionality alignment: 0 This ensures a coordinated allocation of context, avoiding redundancy and omission, and supports a plug-in, architecture-agnostic compression module.
3. Computational Complexity and Parameter Efficiency
ComprExIT introduces minimal parameter overhead: only ≈1% additional parameters compared to the frozen base LLM. Compression cost is 1, substantially below the 2 of full self-attention, and less than 3 incurred by baseline gist-token approaches when 4. Furthermore, all transmission and aggregation steps act on precomputed hidden states, supporting efficient implementation and integration (Ye et al., 3 Feb 2026).
4. Implementation and Training Protocol
- Base models: Llama-3.2-1B and Llama-3.2-3B (frozen weights).
- Transmission module dimensions: 5; 6 slots.
- Optimization:
- Phase 1: Next-token prediction on 1B tokens (SlimPajama corpus), updating only transmission modules (batch size 2048, lr = 7, 1 epoch).
- Phase 2: Supervised fine-tuning on MRQA QA benchmarks (SQuAD, NewsQA, TriviaQA, SearchQA, HotpotQA, NaturalQuestions).
- Integration:
- Compute all hidden states (8) in one forward pass through the frozen LLM.
- Sequentially apply depth-wise and width-wise transmission to generate 9.
- Prepend 0 to the decoder prompt or key/value cache for downstream QA inference.
5. Empirical Performance
ComprExIT was benchmarked on six in-domain QA datasets, applying 1 compression. In all cases, ComprExIT outperformed prior context compression methods and even approached (or occasionally surpassed) the best uncompressed prompt-tuning results.
| Backbone | ICAE EM/F1 | 500× EM/F1 | Beacon EM/F1 | ComprExIT EM/F1 | Prompt-tune EM/F1 |
|---|---|---|---|---|---|
| Llama-1B | 44.2/56.2 | 10.2/17.5 | 14.8/25.1 | 52.3/66.6 | 55.7/66.7 |
| Llama-3B | 53.2/65.9 | 51.9/64.5 | 29.8/39.6 | 59.0/72.9 | 62.3/73.4 |
Ablation studies show that depth-wise aggregation is vital for performance (removal results in a −16 EM drop), while globally coordinated width-wise transport contributes an additional +4.6 EM over local attention heuristics.
6. Insights, Strengths, and Limitations
- Freezing the LLM backbone and operating exclusively on cached hidden states removes the distribution mismatch and progressive overwrite found in LLM-as-compressor approaches.
- Depth-wise gating targets the most decodable features for each token, empirically favoring early/mid layers.
- Global optimal transport allocation enforces a coordinated assignment of information to slots, addressing both redundancy and omission.
- Lightweight design with linear complexity enables deployment on longer contexts and larger models.
- Empirical robustness: Even under aggressive 2 compression, ComprExIT either matches or exceeds uncompressed prompt-tuning on several benchmarks.
The current framework uses fixed 3 and OT window 4, and is demonstrated up to moderate model scales. Variable compression ratios, scaling to much longer input sequences (5), and extension to multimodal or retrieval-augmented settings are identified as promising directions for future work. Practical deployment is straightforward due to parameter efficiency and modularity; no fine-tuning of LLM weights is required (Ye et al., 3 Feb 2026).
7. Significance in Context Compression Research
ComprExIT defines an explicit, globally-coordinated paradigm for soft context compression in LLMs, empirically demonstrating that precise allocation and aggregation of hidden state information—decoupled from LLM self-attention—can outperform both token-selection and gist-token methods under strong compression constraints. This establishes a new benchmark for accuracy, efficiency, and architectural modularity in the design of context compression systems for LLM-based NLP pipelines (Ye et al., 3 Feb 2026).