Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prompt Compression for LLMs

Updated 25 February 2026
  • Prompt compression is a suite of techniques that reduce long, information-rich prompts into compact forms while retaining task-critical content.
  • It uses methods like token pruning, coarse-to-fine scoring, and soft embedding compression to balance compression rate and output fidelity.
  • Applications include cost-efficient multi-document QA and retrieval, demonstrating significant runtime, storage, and performance benefits.

Prompt compression for LLMs comprises a suite of algorithmic strategies that transform long, information-rich prompts into significantly shorter representations, while striving to preserve essential task-relevant information and maximize downstream task accuracy. Rising context lengths in applications such as retrieval-augmented generation, multi-document QA, and complex reasoning exacerbate computational costs—transformer memory and runtime scale quadratically with input length—thus motivating prompt compression as a core area of research (Li et al., 2024, Zhang et al., 24 Apr 2025). Approaches span token- and sentence-level pruning, context- and query-aware scoring, coarse-to-fine procedures, soft continuous embedding compression, and denoising-inspired iterative schemes, often balancing trade-offs among compression rate, information retention, output fidelity, and system-level efficiency.

1. Formal Foundations and Objectives

Prompt compression is formally defined as the map xxx \mapsto x', where x=(x1,...,xN)x = (x_1,...,x_N) is the full prompt and x=(x1,...,xM)x' = (x'_1,...,x'_M), MNM \ll N, is the compressed prompt. The objective is to minimize prompt length, measured in tokens (compression ratio r=N/Mr = N/M), while ensuring that the downstream LLM utility (e.g., loss LLM(x;θ)L_{\mathrm{LM}}(x'; \theta)) does not decrease beyond an acceptable threshold:

minx  {LLM(x;θ)  +  λx}\min_{x'}\;\Bigl\{L_{\rm LM}(x';\theta)\;+\;\lambda\cdot|x'|\Bigr\}

where λ\lambda trades off fidelity against brevity (Jiang et al., 2023, Łajewska et al., 24 Mar 2025, Li et al., 2024). Modern frameworks often cast this as an information-theoretic rate-distortion problem with the achievable lower bound on distortion at any fixed compression rate RR characterized via linear programming (Nagle et al., 2024).

2. Taxonomy and Methodological Landscape

Current methods divide into two principal categories (Li et al., 2024, Zhang et al., 24 Apr 2025):

  • Hard prompt compression: Truncates, prunes, or paraphrases tokens from the input, retaining only those deemed most informative. Representative algorithms include Selective-Context (token-level self-information scoring), LLMLingua/LongLLMLingua (coarse-to-fine pruning with question-aware redistribution and iterative token compression), RL-based token deletion (SCRL, KiS, DCP), and graph-based subgraph extraction (Prompt-SAW) (Jiang et al., 2023, Jiang et al., 2023, Hu et al., 15 Apr 2025, Ali et al., 2024).
  • Soft prompt compression: Encodes the prompt into a learned continuous embedding or set of synthetic soft tokens (GIST, xRAG, 500xCompressor), relying on frozen or lightly fine-tuned decoders. These methods can enable up to 480× compression but may present significant information bottlenecks and require nontrivial architectural modifications (Li et al., 2024, Łajewska et al., 24 Mar 2025).

Hybrid strategies (e.g., CompactPrompt) combine hard token pruning, phrase grouping, and data- or quantization-compressed attachments, supporting broad workflow integration and moderate to high compression with low accuracy drift (Choi et al., 20 Oct 2025).

Category Key Examples Characteristic Compression Ratios Main Tradeoff
Hard, discrete LLMLingua, DCP, SAW 2×–20× (lossless <20×) Simpler pipelines, interpretability, but limited extreme compression
Soft, continuous GIST, 500xCompressor 10×–480× (with lossy degradation) High extreme compression, less explainability, model modification
Hybrid/pipeline CompactPrompt, LoPace 2×–20× (lossless or near-lossless) Storage/runtime, best of both worlds

3. Key Algorithms and Architectural Innovations

Coarse-to-Fine and Multi-Signal Scoring

Hard token pruning methods have evolved from naive self-information thresholds [Selective-Context, (Li et al., 2024)] to multi-stage architectures exemplified by LLMLingua and LongLLMLingua (Jiang et al., 2023, Jiang et al., 2023). The pipeline typically includes:

  • Budget controllers that allocate per-component token quotas (e.g., instruction/demonstration/question) based on dynamic or static priorities.
  • Coarse-grained (document/sentence) filtering, ranking using question-conditioned perplexity, semantic similarity, or retrieval models.
  • Fine-grained (token-level) scoring, leveraging iterative contrastive perplexity drops, attention attribution, loss difference between reference and base models (DSPC), or task-conditional marginalization (Jiang et al., 2023, Gao et al., 17 Sep 2025).
  • Dynamic allocation and reordering, adjusting compression ratios by local information density and mitigating position bias ("lost in the middle") via statistical or attention-based reordering (Jiang et al., 2023, Tang et al., 2024).
  • Iterative or denoising-inspired schemes, wherein aggressive ratios (≥10×) are achieved by multi-step token removal informed by progressive salience recalculation (JPPO++, DCP) (You et al., 2024, Hu et al., 15 Apr 2025).

Reinforcement Learning and Distribution Alignment

RL-based frameworks formulate pruning as a sequential decision process maximizing reward functions that balance compression, key-content retention, and divergence from full-context LLM output. Distribution alignment via instruction-tuned small models calibrates surrogate token scoring, aligning with black-box LLM behavior for improved retention (Hu et al., 15 Apr 2025, Jiang et al., 2023).

Graph-Based and Sentence-Level Methods

Relation-aware frameworks construct semantic graphs (nodes/entities, edge/relations), extracting coverage-maximizing subgraphs under token budgets to maximize semantic fidelity and readability (Ali et al., 2024). Sentence-level compressors train context-aware encoders under contrastive objectives, greedily selecting top-scoring sentences relevant to downstream queries, improving inference speed and coherence at moderate compression (Liskavets et al., 2024).

Soft/Embedding Compression

Autoencoder-based, LoRA-augmented, and prefix-tuning frameworks map entire prompt sequences to a small set of learned vectors or K/V cache values, maintaining end-to-end task fidelity up to compression rates of 480× in specialized settings (Li et al., 2024). Techniques such as Two-Step PT+FT (sentence to multi-sentence chunk pretraining) significantly boost factual and entity preservation compared to baseline soft-prompting (Łajewska et al., 24 Mar 2025).

4. Empirical Results, Tradeoffs, and Limitations

Experimental benchmarks consistently demonstrate that (1) moderate compression (2×–6×) typically preserves or even enhances LLM performance on long-context tasks due to increased information density and reduced noise (e.g., NaturalQuestions: 64.1%→75.0% accuracy with 4× compression, 94% cost reduction) (Jiang et al., 2023, Li et al., 2024, Zhang et al., 24 Apr 2025), while (2) extremely aggressive ratios (>10×) incur pronounced information loss and sometimes increase hallucination rates (Li et al., 2024, Zhang et al., 24 Apr 2025).

Ablation studies highlight that question-/task-awareness (Q-aware scoring, retriever-guided selection, or MDP reward shaping) yields marked gains: disabling Q-aware steps in LongLLMLingua or Perception Compressor drops QA accuracy by 35–45 percentage points (Jiang et al., 2023, Tang et al., 2024). Dynamic thresholding per input and status-conditional pruning approach the information-theoretic optimum achievable distortion–rate curves (LLMLingua-2.Dynamic, AdaptiveQuerySelect) (Nagle et al., 2024).

Notable limitations include:

5. Practical Deployment and System Integration

Prompt compression can be integrated in both online and offline LLM workflows:

  • Pre-inference compression: Run lightweight, question/condition-aware pruning on local or edge devices, then transmit compressed prompts to cloud LLMs (crucial for mobile or limited-bandwidth settings) (You et al., 2024).
  • Agent pipeline integration: Unified pipelines such as CompactPrompt merge prompt pruning with abbreviation and quantization of data attachments, yielding up to 60% token/cost savings with marginal accuracy drift (<5%) (Choi et al., 20 Oct 2025).
  • Storage optimization: Lossless compressors (LoPace) combine BPE tokenization, binary packing, and entropy coding to enable 72.2% space savings in real-world prompt databases with sub-millisecond decompression latency (Ulla, 4 Feb 2026).
  • Toolkit and benchmarking: Tools such as PCToolkit offer plug-and-play benchmarking and deployment of major compressors, promoting reproducibility and rapid experimentation (Li et al., 2024).
Method Compression Ratio Performance Characteristics
LongLLMLingua 2×–6× +10pp QA accuracy, cost ↓94%
JPPO++/DSPC 2×–16× Service-time ↓46%, fidelity tradeoff
500xCompressor 6×–480× 62–73% cap. retention (QA tasks)
CompactPrompt/LoPace (lossless) 2×–5× 50–72% storage/inference reduction
Context-aware sentence methods 3×–5× 10× speedup, long-context QA ↑

6. Evaluation Metrics and Holistic Assessment

Assessment of prompt compression methods must consider not only raw compression ratio, but also a holistic set of metrics (Łajewska et al., 24 Mar 2025, Li et al., 2024):

  • Downstream task performance: Exact match, F1, BERTScore, pass@1 (code) as applicable.
  • Grounding: Alignment of LLM outputs with the original context (e.g., BERTScore-F1 between output and input).
  • Information preservation: Entity retention scores, n-gram overlap, faithfulness measures (reconstruction fidelity for soft prompts).
  • Latency and cost: End-to-end inference time, API cost per sample, real-time system throughput.
  • Hallucination rates: Prevalence of information loss/semantic drift.

Adoption of these metrics enables rigorous comparison across heterogeneous approaches, tasks, and operational constraints.

7. Outlook, Open Problems, and Best Practices

Despite substantial advances, the field faces persistent challenges:

  • Attaining rate-distortion limits: There remains a significant, empirically demonstrable gap between heuristic/practical compressors and the theoretical optimum; query-aware, variable-rate methods partially close this (Nagle et al., 2024).
  • Extreme compression: Only specialized soft/KV-based or multi-step denoising schemes achieve >10× ratios with usable fidelity; further research into hybrid and modular strategies is warranted (Li et al., 2024, You et al., 2024).
  • Generalization and multimodal prompts: Robustness to domain shift, mixed media, and varying format inputs is open (Liskavets et al., 2024, Liang et al., 2024).
  • Unified/composable toolkits: Modular frameworks that combine retrieval, pruning, paraphrasing, and soft/embedding compression remain limited (Choi et al., 20 Oct 2025, Li et al., 2024).
  • Error/budget control: Methods such as Cmprsr’s GRPO ensure strict adherence to compression budgets, a critical feature in production (Zakazov et al., 15 Nov 2025).

Best practices include always leveraging Q-aware multi-signal scoring, carefully controlling compression thresholds via feedback/validation curves, and deploying lossless methods in storage/conversation history settings where lossy information loss is intolerable (Jiang et al., 2023, Ulla, 4 Feb 2026). Monitoring for hallucination and accuracy drift under compression is essential for reliable system behavior (Zhang et al., 24 Apr 2025, Łajewska et al., 24 Mar 2025).


In summary, prompt compression for LLMs constitutes a multi-faceted research domain characterized by the design and evaluation of algorithms that map information-rich, lengthy prompts into compact representations with minimal loss of task utility, achieving substantial computational and economic savings. Progress continues to be driven by advances in query-aware dynamic compression, integration of multi-level and multi-modal signals, and principled evaluation against information-theoretic bounds (Jiang et al., 2023, Li et al., 2024, Łajewska et al., 24 Mar 2025, Nagle et al., 2024, Gao et al., 17 Sep 2025, Zakazov et al., 15 Nov 2025, Choi et al., 20 Oct 2025, Ulla, 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prompt Compression for Large Language Models (LLMs).