Conversation Compression Strategies

Updated 27 November 2025

Conversation compression strategies are techniques that reduce extended dialogue histories into compact representations while preserving key semantic information.
They utilize algorithms such as sink-based aggregation, segmentation, and conditional LoRA to balance computational efficiency with effective memory retention.
Recent approaches demonstrate significant improvements in inference speed and memory reduction while maintaining high performance in tasks like QA and context reconstruction.

Conversation compression strategies comprise a set of methodologies that reduce the length and redundancy of dialogue histories in order to enable scalable, efficient, and semantically faithful processing by LLMs. These strategies manage context growth, computational complexity, and semantic retention in multi-turn interactions, typically leveraging specialized algorithms, architectural adaptations, or prompt engineering protocols. Recent approaches emphasize memory unit granularity, lossy versus lossless semantic preservation, and adaptive mechanisms targeted for diverse downstream conversational tasks.

1. Fundamental Concepts and Formal Definitions

Conversation compression operates by transforming extended conversational transcripts into compact representations while maintaining informativeness for downstream tasks. Memory units may be structured at varying granularities: turn-level (individual utterances), session-level (full dialogues), summary-based (abstracted event summaries), or segment-level (topically coherent sub-dialogues) (Pan et al., 8 Feb 2025). Compression methodologies frequently employ lossy mechanisms that prioritize semantic preservation over exact textual fidelity, enabling LLMs to process contexts far exceeding raw token window limitations (Gilbert et al., 2023, Li et al., 25 Feb 2024).

Semantic compression is formalized by a map $f_{\text{comp}}: \bm{m} \mapsto \tilde{\bm{m}}$ where $|\tilde{\bm{m}}| \ll |\bm{m}|$ , and evaluated via metrics such as Semantic Reconstruction Effectiveness (SRE) and Exact Reconstructive Effectiveness (ERE), which trade off compression ratio, edit distance, and embedding-based semantic similarity (Gilbert et al., 2023).

2. Architectural and Algorithmic Strategies

Conversation Attention Sinks and Sink-Based Key/Value Aggregation

The StreamingDialogue methodology identifies special tokens (End-of-Utterance, EoU) as "conversational attention sinks" (conv-attn sinks) (Li et al., 13 Mar 2024). These sinks aggregate the preceding utterance's content within a single key/value vector, enabling cache-efficient streaming that quadratically reduces computational complexity: from $O((N \cdot L)^2)$ for full token contexts to $O(N^2)$ when processing only utterance sinks, where $N$ is the number of utterances and $L$ the average token count. Reconstruction learning mechanisms include Short-Memory Reconstruction (SMR)—forcing reconstruction from local sink—and Long-Memory Reactivation (LMR)—enabling recovery from distant sinks through masked attention.

Memory Compression in Unified Dialogue Models

The COMEDY framework integrates session-level summarization, memory compression, and response generation within a single LLM (the "One-for-All" paradigm), eschewing external retrieval modules and databases (Chen et al., 19 Feb 2024). Dialogue histories are first distilled into memory units: objective summaries, then compressed into concatenated fields (historical events, user profile, user-bot relationship). Training combines supervised fine-tuning and Direct Preference Optimization to maximize alignment with the compressive memory. Inference concatenates compressive memory and current dialogue context for prediction, streamlining all steps within a single call.

Prompt Compression via LLMLingua-2 and Segment-Level Memory

The SeCom method advances compression by applying a conversation segmentation model to partition sessions into topically coherent segments, with LLMLingua-2 compressor acting as a denoiser over these segment units (Pan et al., 8 Feb 2025). The loss function incorporates both brevity ( $\lambda\frac{|\bm{y}|}{|\bm{x}|}$ ) and fidelity ( $-\log P_\theta(\bm{y}|\bm{x})$ ) terms. Empirical results show segment-level compressed memory supports superior retrieval and QA accuracy compared to turn- or session-level alternatives.

Compressed Context Memory with Conditional LoRA

CCM continually compresses attention key/value pairs into compact memory by deploying a lightweight conditional LoRA adapter, activated at special COMP tokens appended to each utterance (Kim et al., 2023). This adapter applies rank- $k$ feed-forward updates only on COMP tokens, keeping the base model weights frozen. Two update schemes are used—"concat" (growing memory) and "merge" (fixed-size memory)—with the approach showing near-full-context model performance using 5-100× less memory.

Style-Conditional Compression

Style-Compress introduces adaptive compression, wherein a small LM (LMcompr) samples compressed prompt variants under different style instructions (extractive, abstractive, format-aware), ranks them with a large evaluation model (LMeval), and retains the best through comparative advantage metrics. Compressed prompts are selected according to task-specific needs (e.g., retrieval, QA, reasoning), and the demonstrated method consistently outperforms static baselines across multiple metrics—ROUGE, BERTScore, EM, F1, and accuracy—even at extreme (10–25%) compression ratios (Pu et al., 17 Oct 2024).

3. Training and Inference Pipelines

Several pipelines have been proposed to operationalize compression:

StreamingDialogue: Pre-train using SMR and LMR, then fine-tune with streaming attention masks. At inference, maintain a cache of conv-attn sinks; generate each turn using only the sinks plus the most recent utterances (Li et al., 13 Mar 2024).
COMEDY: Jointly train memory extraction, compression, and response generation via annotated datasets. Inference involves compressing session summaries into $\tilde{M}$ and passing them into LLM for utterance generation (Chen et al., 19 Feb 2024).
SeCom: Apply LLM-based segmentation for topic boundaries; compress each segment via LLMLingua-2; populate segment-level memory bank for retrieval-augmented generation (Pan et al., 8 Feb 2025).
CCM: During training, employ masked attention and single-pass LoRA updates; at inference, recursively compress each turn into COMP features and update memory for next input (Kim et al., 2023).
Style-Compress: Adapt small demo pool via style variation and in-context learning; select best variants for use in compressing incoming prompts (Pu et al., 17 Oct 2024).

4. Quantitative Evaluation and Comparative Results

Compression methodologies are evaluated on perplexity, BLEU, ROUGE, BERTScore, accuracy, retrieval recall, and human judgment metrics:

Approach	Compression Ratio	Task Metric	Result (Sample)
StreamingDialogue	—	PPL, BLEU, ROUGE	PPL=7.99, BLEU=19.33, 4× speedup, 18× memory reduction (Li et al., 13 Mar 2024)
SeCom (BM25)	~0.5–0.75	GPT4Score	71.57 vs turn-level 65.58 (Pan et al., 8 Feb 2025)
CCM-concat	0.16 (memory)	Accuracy	70.0% vs full context 70.8% (Kim et al., 2023)
COMEDY-13B DPO	—	Human Coherence	+0.2 over best retriever; Top-1 in 29.9% cases (Chen et al., 19 Feb 2024)
Style-Compress	0.1–0.5	ROUGE-L/BERTScore	0.876/0.231 @ r=0.1, often matching full prompt (Pu et al., 17 Oct 2024)
Gist-COCO	0.01–0.05	QA Accuracy	+20% over prior baselines (Li et al., 25 Feb 2024)

Results indicate that compression ratios of 5×–100×—often with a small adapter or learned module—sustain near-full-context performance at dramatically reduced computational cost. Some evidence points to improved downstream results after denoising and style adaptation.

5. Constraints, Limitations, and Extension Opportunities

Compression approaches invariably involve trade-offs:

Lossy compression may introduce semantic drift or occasional misunderstanding, compromising verbatim fidelity—lossless variants or guided prompts are recommended for regulatory scenarios (Gilbert et al., 2023).
Memory update policies (e.g., CCM-merge) may oversmooth heterogeneous dialogues; concatenation better preserves fine-grained history but at higher cost (Kim et al., 2023).
Selective sink or memory unit dropping can further reduce overhead, but risks omitting infrequent yet critical historical detail (Li et al., 13 Mar 2024, Chen et al., 19 Feb 2024).
Style-conditional and denoising compression mechanisms require continuous adaptation and monitoring for performance drift across tasks (Pu et al., 17 Oct 2024).
Ethical considerations involve annotating and filtering chat data to mitigate bias and privacy risks; auditing is warranted during deployment (Chen et al., 19 Feb 2024).

Extension avenues include hierarchical or adaptive compression rates, dynamic weighting in memory merging, integration with retrieval modules for ultra-long contexts, and multi-modal memory compression for agents handling mixed text/image input.

6. Position in the Research Landscape

Conversation compression is situated at the intersection of efficient transformer architectures (Longformer, BigBird), memory augmentation (Transformer-XL, MemGPT), retrieval-augmented generation, and prompt engineering. Recent innovations achieve a practical middle ground—streamlined attention spans, high semantic retention, and minimal architectural disruption—by leveraging natural dialogue structure (EoU sinks), plug-in compression modules (LoRA, gist-encoders), and adaptive style or segment-based strategies (Li et al., 13 Mar 2024, Li et al., 25 Feb 2024, Pu et al., 17 Oct 2024). These methods generalize beyond mechanical token reduction, enabling LLMs to sustain coherent, long-term reasoning and personalization with scalable context windows previously unattainable on standard hardware.