Papers
Topics
Authors
Recent
2000 character limit reached

LLMLingua-2: Efficient Prompt Compression for LLMs

Updated 16 December 2025
  • LLMLingua-2 is an extractive prompt compression method that uses GPT-4 distilled supervision to achieve high compression ratios without sacrificing downstream task performance.
  • It leverages a bidirectional Transformer encoder to classify tokens, ensuring only crucial content is retained and preserving the original prompt's integrity.
  • Empirical results demonstrate 2×–5× compression and significant inference speedups while maintaining accuracy across tasks such as QA, summarization, and mathematical reasoning.

LLMLingua-2 is a method for efficient, faithful, and highly generalizable task-agnostic prompt compression, specifically designed for LLM inference under context window and latency constraints. LLMLingua-2 formulates compression as an extractive token classification problem, moving beyond information-entropy-based trimming and leveraging a distilled supervision signal from GPT-4. The approach ensures that every retained token is justified with respect to the original prompt’s content, provides substantial compression ratios (2×–5×), and achieves compression and end-to-end inference speedups compared to LLaMA- or GPT-based approaches, while preserving performance across a wide range of downstream tasks and domains (Pan et al., 19 Mar 2024).

1. Problem Formulation and Motivation

LLMLingua-2 targets the task-agnostic prompt compression problem: Given an input prompt xx of NN tokens, the goal is to learn a function C(x,τ)C(x, \tau) that selects a subset x~\tilde{x} of tokens of size N~τN\tilde{N} \approx \tau N (0<τ<10<\tau<1) without access to downstream task labels. Prior art typically removes tokens with low information entropy H(xi)H(x_i), as computed from a small causal LM such as LLaMA-7B, but this introduces two critical limitations:

  • Limited Context Utilization: Unidirectional context ignores dependencies and information only captured by bidirectional models.
  • Objective Misalignment: Entropy-based heuristics are not optimized for faithfulness to downstream task utility, so preservation of high-entropy tokens does not guarantee critical content remains.

The objectives of LLMLingua-2 are threefold:

  • Faithfulness: Preserve all essential content without introducing new/hallucinated tokens.
  • Efficiency: Replace heavyweight compressive LMs with a compact Transformer encoder (e.g., XLM-RoBERTa-large or mBERT).
  • Generalization: Train on a task-agnostic extractive distillation dataset and demonstrate consistent transfer across multiple LLM architectures (e.g., GPT-3.5, Mistral-7B) and tasks (summarization, QA, mathematical reasoning, etc.).

2. Data Distillation and Supervision Signal

LLMLingua-2 introduces a novel supervised dataset distillation procedure:

  • Extractive Data Construction: MeetingBank transcripts (5,1695{,}169 transcripts split into 41,74641{,}746 chunks of 512\leq512 tokens) are compressed using GPT-4 with controlled instructions permitting only token removals—strictly prohibiting reordering, rewriting, or hallucination—and maximizing compression while retaining all “crucial information.”
  • Alignment and Annotation: The compressed and original texts are aligned via fuzzy matching and sliding-window search algorithms, resulting in per-token binary labels: yi{0,1}y_i \in \{0,1\}, where yi=1y_i=1 indicates “preserve.”
  • Optimization Objective: Supervised learning minimizes a cross-entropy token classification loss:

Lcls(x,y)=i=1N[yilogp(yi=1x)+(1yi)logp(yi=0x)]\mathcal{L}_{\mathrm{cls}}(x, y) = -\sum_{i=1}^N \bigl[y_i\log p(y_i=1\mid x)+(1-y_i)\log p(y_i=0\mid x)\bigr]

  • Distillation Quality Control:

    • Variation Rate (VR): Fraction of compressed words absent from the original,

    VR=1ScompwScomp1[wSori]\mathrm{VR} = \frac{1}{|\mathcal S_{\mathrm{comp}}|}\sum_{w\in \mathcal S_{\mathrm{comp}}} \mathbf{1}[w\notin \mathcal S_{\mathrm{ori}}] - Alignment Gap (AG): Difference between “hitting rate” (HR) and “matching rate” (MR),

    MR=1SoriwSori1[l(w)=1],HR=1ScompwScomp1[wSori],AG=HRMR\mathrm{MR} = \frac{1}{|\mathcal S_{\mathrm{ori}}|}\sum_{w\in\mathcal S_{\mathrm{ori}}}\mathbf{1}[l(w)=1],\, \mathrm{HR} = \frac{1}{|\mathcal S_{\mathrm{comp}}|}\sum_{w\in\mathcal S_{\mathrm{comp}}}\mathbf{1}[w\in\mathcal S_{\mathrm{ori}}],\, \mathrm{AG} = \mathrm{HR} - \mathrm{MR} - Top 5%5\% VR and 10%10\% AG samples are removed to ensure extraction “faithfulness.”

This distillation procedure enforces the extractive, faithful, and redundancy-minimizing nature of the compression (Pan et al., 19 Mar 2024).

3. Model Architecture and Compression Workflow

Prompt compression is formulated as a per-token binary classification:

  • Backbone: A bidirectional Transformer encoder fθf_\theta (XLM-RoBERTa-large for the default model, mBERT for a smaller variant) produces contextual embeddings hih_i for each word in xx.
  • Prediction: For each token, compute:

pi=softmax(Whi+b)R2p_i = \mathrm{softmax}(W h_i + b) \in \mathbb{R}^2

where pi[1]=P(preservexi)p_i[1] = P(\text{preserve}|x_i).

  • Extraction: Select the top N~=τN\tilde N = \tau N tokens by pi[1]p_i[1], maintaining their original order:

x~=sortorig  order{xi:pi[1] in top N~}\tilde x = \mathrm{sort}_{\mathrm{orig\;order}}\bigl\{x_i : p_i[1] \text{ in top } \tilde N \bigr\}

This approach guarantees output sequences are always subsequences of the source (extractive), ensuring faithfulness and preventing reordering or hallucination.

4. Training Regimen and Implementation

LLMLingua-2 experiments with two main model configurations:

  • LLMLingua-2: Uses XLM-RoBERTa-large (355M parameters).
  • LLMLingua-2-small: Uses mBERT (110M parameters).

Training utilizes the Adam optimizer (learning rate 1×1051\times 10^{-5}, batch size 10, for 10 epochs) on the MeetingBank dataset. Inference is performed with greedy decoding (temperature $0$), which ensures determinism for downstream LLM prompting.

Resource requirements are modest (2.1 GB peak GPU memory on V100-class hardware), contrasting with 16–26 GB for alternative LLaMA-based compressors.

5. Empirical Evaluation and Benchmarks

LLMLingua-2 is evaluated on both in-domain and out-of-domain corpora, including MeetingBank (QA and summarization), LongBench, ZeroScrolls, GSM8K, and BBH. Core evaluation metrics include:

  • Compression Ratio: N/N~N/\tilde{N} (ranges 2×2\times5×5\times on target benchmarks).
  • End-to-End Latency: Sum of compressor and LLM inference time.
  • Speedup: Ratio of inference times with and without compression.

Quantitative Results

Dataset & Metric LLMLingua (LLaMA) LLMLingua-2-small LLMLingua-2 Original Prompt
MeetingBank QA F1 67.5 85.8 86.9 87.8
MeetingBank ROUGE-1 38.0 48.3 48.6 47.3
LongBench avg. (5×) 34.6 38.2 39.1
GSM8K 1-shot EM (5×) 79.08 79.08
BBH 1-shot acc. (3×) 70.11 70.02

Compressing MeetingBank QA prompts (3×) yields a reduction in end-to-end latency from 14.9s14.9\,\mathrm{s} (no compression) to 7.5s7.5\,\mathrm{s} (2.1× speedup), with compressor time 0.4s0.4\,\mathrm{s} (vs. 2.1s2.1\,\mathrm{s} for LLMLingua). This demonstrates 3×–6× faster prompt compression and 1.6×–2.9× speedup in end-to-end LLM inference, with a minor performance gap versus the original full-length prompt (Pan et al., 19 Mar 2024).

6. Analysis, Ablations, and Generalization

Ablation studies indicate:

  • Chunk-wise compression (as opposed to concatenated or one-shot compression) and the precise GPT-4 instruction are critical; altering these increases VR ($9$–13%13\%) and reduces QA F1 by $8$–$17$ points.
  • Cross-lingual generalization is robust: English-trained models transfer well to Chinese LongBench-Zh (ROUGE-1 of $38.1$ vs. baseline $28.6$ at 5×5\times compression).
  • Compressed prompts enhance Mistral-7B LLM performance relative to uncompressed originals, suggesting that reducing input length and redundancy can facilitate LLM reasoning and output quality.

7. Limitations and Prospective Extensions

LLMLingua-2’s supervision is currently derived from MeetingBank meeting transcripts, and extending distillation to other domains such as news or encyclopedic text produces only marginal gains, indicating redundancy patterns may generalize. However, coverage expansion and refined sampling could close the remaining faithfulness gap. A plausible implication is that richer, multi-domain extractive supervision may yield further marginal improvements.

Allowing sample-wise dynamic compression ratios under a global token budget can improve downstream performance by $4$–5%5\%. Future work suggests incorporating light task-aware signals, such as retrieval-based or saliency scores, without sacrificing model compactness or efficiency.

In summary, LLMLingua-2 operationalizes prompt compression as a distilled, supervised extractive classification task using compact Transformer encoders, rigorously outperforming entropy-based approaches on speed, faithfulness, and broad generalizability (Pan et al., 19 Mar 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LLMLingua-2.