CompLLM: Efficient Context Compression
- CompLLM is a context compression method that segments long inputs into smaller blocks, converting tokens into Concept Embeddings for efficient processing.
- It applies a LoRA-based neural compressor per segment to reduce computational complexity from quadratic to linear, enhancing speed and cutting memory usage.
- The method enables caching for overlapping contexts while maintaining high accuracy in long-context Q&A, making it ideal for real-world retrieval-augmented tasks.
CompLLM is a context compression method designed to address the computational bottlenecks inherent in LLMs when processing long input sequences. Traditional self-attention in LLMs scales quadratically with context length, rendering direct inference on extremely long contexts computationally expensive or altogether infeasible. CompLLM overcomes this by segment-wise soft compression, yielding a linear scaling approach that preserves performance while substantially reducing both latency and memory requirements for long-context question answering.
1. Motivation and Architectural Overview
The computational challenge with long-context LLM inference derives from the self-attention operation, which has time and memory complexity for an input of tokens. While prior soft compression schemes aim to distill input into a more compact latent representation, they typically operate on the context as a single block. This “holistic” approach not only inherits quadratic complexity in the compression step but also precludes reusing computation when queries share overlapping context.
CompLLM divides an input context of tokens into contiguous segments of tokens (with typically set to 20). Each segment is independently compressed into “Concept Embeddings” using a lightweight neural compressor module (a Low-Rank Adaptation (LoRA) extension and a linear projection). This per-segment design enables linear scaling in overall computational complexity, allows compressed segments to be cached for reuse across queries, and supports generalization of models trained solely on moderately short sequences to contexts of length $100$k and beyond.
2. Technical Methodology
Given input tokens segmented into non-overlapping blocks of length , each block is compressed via an adapted neural module on top of the base LLM. The computational cost per block (compression) is , leveraging internal self-attention. For all blocks, total cost is:
which is linear in for constant .
Compression per block involves projecting Token Embeddings into Concept Embeddings using the LoRA-adapted compressor and a final dense layer. Importantly, these Concept Embeddings live in the same latent space as the Token Embeddings, allowing direct interchangeability and eliminating the need for further model fine-tuning.
The compressor is trained by distilling hidden activations from standard (uncompressed) inference. For a set of answer token indices and model depth , hidden states from the teacher (full context) and from the student (compressed) context are compared. The per-layer loss is normalized by the standard deviation of the teacher activations:
with
and
This objective enforces local correspondence between compressed and uncompressed representations, preserving essential information for downstream reasoning.
3. Key Properties: Efficiency, Scalability, and Reusability
Efficiency: Segment-wise compression restricts the self-attention window to each -length segment, yielding complexity across segments and dramatically accelerating inference, especially during context prefill and early generation (Time To First Token, TTFT).
Scalability: By limiting training to short segments, models can process at inference time sequences of $100$k or more tokens, generalizing well in long-context regimes without retraining for large . This factor is enabled by the local, context-agnostic compressor's independence across segments.
Reusability: As each segment is compressed independently, its Concept Embedding can be cached. In retrieval-augmented scenarios or code assistants working across overlapping contexts, this eliminates the need to re-compute segment compressions for shared sub-contexts, yielding further efficiency gains in batch and interactive workloads.
4. Experimental Results and Comparative Performance
With a typical compression rate , CompLLM achieves the following for long contexts:
- Speedup in TTFT: Up to compared to uncompressed inference, owing to the reduced scaling of the key-value (KV) cache prefill.
- KV Cache Reduction: decrease due to half the number of embeddings stored per context window.
- Accuracy: Matching or exceeding the performance of the original LLM on long-context Q&A, with improvement observed for very long contexts. The compressed model maintains fidelity in answer generation due to careful alignment of hidden representations during training.
When compared with established methods such as LLMLingua-2 (which also utilizes sentence-level compression), CompLLM yields better or comparable accuracy at moderate lengths (k tokens), and its efficiency and caching features provide practical advantages at scale.
5. Practical Applications
The design of CompLLM enables efficient deployment in a variety of real-world LLM workloads:
- Retrieval-Augmented Generation: Pre-cached compressed representations of documents enable rapid multi-document retrieval and aggregation without repeated computation.
- Massive Contextual Search/QA: Legal document analysis, codebase exploration, and any scenario relying on extremely long source materials benefit from both the reduced computational cost and maintained context fidelity.
- Production LLM Systems: By integrating segment-wise compression and caching, existing LLM architectures can serve longer context windows on limited hardware, broadening their deployment viability.
6. Limitations and Future Directions
Current compression rates (e.g., ) evince a trade-off between speed, memory, and information retention. Extreme compression or small segment lengths could degrade the ability to preserve inter-segment dependencies, particularly if strong cross-segment context is required. While the current approach uses a LoRA-based compressor with a simple linear output, future work may examine non-linear or cross-segment-aware compressors to further improve scalability without loss. The cache management policy and integration into complex pipeline systems remain open engineering challenges for maximizing CompLLM’s practical utility.
CompLLM introduces an efficient, scalable, and reusable framework for context compression in long-context LLM Q&A (Berton et al., 23 Sep 2025). By employing segment-wise independent compression with straightforward training objectives, it enables substantial improvements in both computational efficiency and maximum context length, suggesting a clear direction for practical, production-grade deployment of LLMs on tasks that demand high-fidelity processing of long unstructured documents.