Papers
Topics
Authors
Recent
Search
2000 character limit reached

ComprExIT: Soft Context Compression in LLMs

Updated 2 July 2026
  • ComprExIT is a framework for soft context compression in LLMs that transmits explicit information over frozen hidden states to optimize context representation.
  • It employs a two-stage mechanism: depth-wise aggregation across layers and width-wise optimal transport for coordinated slot allocation.
  • ComprExIT achieves state-of-the-art long-context QA performance with <1% additional parameters, reducing computational complexity compared to full self-attention.

ComprExIT is a framework for soft context compression in LLMs, formulated as explicit information transmission over frozen LLM hidden states. It addresses key limitations of prior LLM-as-compressor paradigms—namely, representation overwriting across layers and uncoordinated allocation of compression capacity over tokens—by introducing a two-stage, globally coordinated transmission scheme: depth-wise transmission (across layers) and width-wise transmission (across token anchors to compression slots) using optimal transport. ComprExIT achieves state-of-the-art performance in long-context question answering under aggressive compression ratios, with negligible (<1%) additional parameters and without fine-tuning the base LLM (Ye et al., 3 Feb 2026).

1. Motivation and Problem Setting

LLMs incur quadratic time and memory complexity in self-attention over long input sequences of length NN, imposing practical bottlenecks in inference and prompting the need for context compression. Traditional context compression paradigms include:

  • Hard compression: Token selection or pruning; efficient but yields significant information loss under high compression.
  • Soft compression: Insertion of continuous “gist” or “memory” tokens, relying on trainable attention to summarize the context within the LLM.

Prior “LLM-as-compressor” approaches exhibit two intertwined structural deficits. First, compression tokens progressively overwrite their state at each layer, leading to drift in representation and suboptimal context aggregation for the decoder. Second, independent token-wise aggregation leads to redundancy or omission—the allocation of compression bandwidth is not globally optimized, and information is neither uniquely nor sufficiently assigned to each compression token.

ComprExIT reframes the compression problem as explicit information transmission over cached hidden states from a frozen LLM backbone, decoupling compression from the internal self-attention mechanism and allowing precise, coordinated context selection and aggregation.

2. The ComprExIT Framework

ComprExIT operates in two coordinated stages:

2.1 Depth-wise Transmission: Layer-to-Anchor Aggregation

Given all intermediate hidden states {ht()}\{\bm h_t^{(\ell)}\} for each token tt and layer \ell, the goal is to aggregate information across the LL layers for each position tt into a single, information-rich token anchor h~t\tilde{\bm h}_t.

Formally,

h~t==1Lαt,Waht()\tilde{\bm h}_t = \sum_{\ell=1}^L \alpha_{t,\ell} W_a \bm h_t^{(\ell)}

where softmax-normalized layer attentions αt,\alpha_{t,\ell} are computed via gating over projected layer features and learnable layer embeddings, with temperature τ\tau for control. The scoring function uses both a structural mixture of all layers and explicit layer features: {ht()}\{\bm h_t^{(\ell)}\}0 This mechanism allows the anchor vector to select the layer best aligned with decoder expectations and mitigates cumulative information loss from repeated layer updates.

2.2 Width-wise Transmission: Anchor-to-Slot Allocation via Optimal Transport

The {ht()}\{\bm h_t^{(\ell)}\}1 token anchors {ht()}\{\bm h_t^{(\ell)}\}2 are mapped to a reduced set of {ht()}\{\bm h_t^{(\ell)}\}3 compression slots {ht()}\{\bm h_t^{(\ell)}\}4 using a globally optimized transport plan {ht()}\{\bm h_t^{(\ell)}\}5. Anchor-to-slot assignment is posed as an instance of entropy-regularized optimal transport, where utility is given by cosine similarity between projected token and slot features: {ht()}\{\bm h_t^{(\ell)}\}6 Each anchor {ht()}\{\bm h_t^{(\ell)}\}7 is assigned a sender capacity {ht()}\{\bm h_t^{(\ell)}\}8 (via learned gating), and receivers (slots) share uniform capacity. The Sinkhorn-Knopp algorithm yields the soft assignment {ht()}\{\bm h_t^{(\ell)}\}9 subject to marginal constraints.

Final slot vectors are computed as capacity-weighted linear projections of anchors, followed by an MLP for dimensionality alignment: tt0 This ensures a coordinated allocation of context, avoiding redundancy and omission, and supports a plug-in, architecture-agnostic compression module.

3. Computational Complexity and Parameter Efficiency

ComprExIT introduces minimal parameter overhead: only ≈1% additional parameters compared to the frozen base LLM. Compression cost is tt1, substantially below the tt2 of full self-attention, and less than tt3 incurred by baseline gist-token approaches when tt4. Furthermore, all transmission and aggregation steps act on precomputed hidden states, supporting efficient implementation and integration (Ye et al., 3 Feb 2026).

4. Implementation and Training Protocol

  • Base models: Llama-3.2-1B and Llama-3.2-3B (frozen weights).
  • Transmission module dimensions: tt5; tt6 slots.
  • Optimization:
    • Phase 1: Next-token prediction on 1B tokens (SlimPajama corpus), updating only transmission modules (batch size 2048, lr = tt7, 1 epoch).
    • Phase 2: Supervised fine-tuning on MRQA QA benchmarks (SQuAD, NewsQA, TriviaQA, SearchQA, HotpotQA, NaturalQuestions).
  • Integration:
    • Compute all hidden states (tt8) in one forward pass through the frozen LLM.
    • Sequentially apply depth-wise and width-wise transmission to generate tt9.
    • Prepend \ell0 to the decoder prompt or key/value cache for downstream QA inference.

5. Empirical Performance

ComprExIT was benchmarked on six in-domain QA datasets, applying \ell1 compression. In all cases, ComprExIT outperformed prior context compression methods and even approached (or occasionally surpassed) the best uncompressed prompt-tuning results.

Backbone ICAE EM/F1 500× EM/F1 Beacon EM/F1 ComprExIT EM/F1 Prompt-tune EM/F1
Llama-1B 44.2/56.2 10.2/17.5 14.8/25.1 52.3/66.6 55.7/66.7
Llama-3B 53.2/65.9 51.9/64.5 29.8/39.6 59.0/72.9 62.3/73.4

Ablation studies show that depth-wise aggregation is vital for performance (removal results in a −16 EM drop), while globally coordinated width-wise transport contributes an additional +4.6 EM over local attention heuristics.

6. Insights, Strengths, and Limitations

  • Freezing the LLM backbone and operating exclusively on cached hidden states removes the distribution mismatch and progressive overwrite found in LLM-as-compressor approaches.
  • Depth-wise gating targets the most decodable features for each token, empirically favoring early/mid layers.
  • Global optimal transport allocation enforces a coordinated assignment of information to slots, addressing both redundancy and omission.
  • Lightweight design with linear complexity enables deployment on longer contexts and larger models.
  • Empirical robustness: Even under aggressive \ell2 compression, ComprExIT either matches or exceeds uncompressed prompt-tuning on several benchmarks.

The current framework uses fixed \ell3 and OT window \ell4, and is demonstrated up to moderate model scales. Variable compression ratios, scaling to much longer input sequences (\ell5), and extension to multimodal or retrieval-augmented settings are identified as promising directions for future work. Practical deployment is straightforward due to parameter efficiency and modularity; no fine-tuning of LLM weights is required (Ye et al., 3 Feb 2026).

7. Significance in Context Compression Research

ComprExIT defines an explicit, globally-coordinated paradigm for soft context compression in LLMs, empirically demonstrating that precise allocation and aggregation of hidden state information—decoupled from LLM self-attention—can outperform both token-selection and gist-token methods under strong compression constraints. This establishes a new benchmark for accuracy, efficiency, and architectural modularity in the design of context compression systems for LLM-based NLP pipelines (Ye et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ComprExIT.