LLMLingua-2: Efficient Prompt Compression for LLMs
- LLMLingua-2 is an extractive prompt compression method that uses GPT-4 distilled supervision to achieve high compression ratios without sacrificing downstream task performance.
- It leverages a bidirectional Transformer encoder to classify tokens, ensuring only crucial content is retained and preserving the original prompt's integrity.
- Empirical results demonstrate 2×–5× compression and significant inference speedups while maintaining accuracy across tasks such as QA, summarization, and mathematical reasoning.
LLMLingua-2 is a method for efficient, faithful, and highly generalizable task-agnostic prompt compression, specifically designed for LLM inference under context window and latency constraints. LLMLingua-2 formulates compression as an extractive token classification problem, moving beyond information-entropy-based trimming and leveraging a distilled supervision signal from GPT-4. The approach ensures that every retained token is justified with respect to the original prompt’s content, provides substantial compression ratios (2×–5×), and achieves compression and end-to-end inference speedups compared to LLaMA- or GPT-based approaches, while preserving performance across a wide range of downstream tasks and domains (Pan et al., 19 Mar 2024).
1. Problem Formulation and Motivation
LLMLingua-2 targets the task-agnostic prompt compression problem: Given an input prompt of tokens, the goal is to learn a function that selects a subset of tokens of size () without access to downstream task labels. Prior art typically removes tokens with low information entropy , as computed from a small causal LM such as LLaMA-7B, but this introduces two critical limitations:
- Limited Context Utilization: Unidirectional context ignores dependencies and information only captured by bidirectional models.
- Objective Misalignment: Entropy-based heuristics are not optimized for faithfulness to downstream task utility, so preservation of high-entropy tokens does not guarantee critical content remains.
The objectives of LLMLingua-2 are threefold:
- Faithfulness: Preserve all essential content without introducing new/hallucinated tokens.
- Efficiency: Replace heavyweight compressive LMs with a compact Transformer encoder (e.g., XLM-RoBERTa-large or mBERT).
- Generalization: Train on a task-agnostic extractive distillation dataset and demonstrate consistent transfer across multiple LLM architectures (e.g., GPT-3.5, Mistral-7B) and tasks (summarization, QA, mathematical reasoning, etc.).
2. Data Distillation and Supervision Signal
LLMLingua-2 introduces a novel supervised dataset distillation procedure:
- Extractive Data Construction: MeetingBank transcripts ( transcripts split into chunks of tokens) are compressed using GPT-4 with controlled instructions permitting only token removals—strictly prohibiting reordering, rewriting, or hallucination—and maximizing compression while retaining all “crucial information.”
- Alignment and Annotation: The compressed and original texts are aligned via fuzzy matching and sliding-window search algorithms, resulting in per-token binary labels: , where indicates “preserve.”
- Optimization Objective: Supervised learning minimizes a cross-entropy token classification loss:
- Distillation Quality Control:
- Variation Rate (VR): Fraction of compressed words absent from the original,
- Alignment Gap (AG): Difference between “hitting rate” (HR) and “matching rate” (MR),
- Top VR and AG samples are removed to ensure extraction “faithfulness.”
This distillation procedure enforces the extractive, faithful, and redundancy-minimizing nature of the compression (Pan et al., 19 Mar 2024).
3. Model Architecture and Compression Workflow
Prompt compression is formulated as a per-token binary classification:
- Backbone: A bidirectional Transformer encoder (XLM-RoBERTa-large for the default model, mBERT for a smaller variant) produces contextual embeddings for each word in .
- Prediction: For each token, compute:
where .
- Extraction: Select the top tokens by , maintaining their original order:
This approach guarantees output sequences are always subsequences of the source (extractive), ensuring faithfulness and preventing reordering or hallucination.
4. Training Regimen and Implementation
LLMLingua-2 experiments with two main model configurations:
- LLMLingua-2: Uses XLM-RoBERTa-large (355M parameters).
- LLMLingua-2-small: Uses mBERT (110M parameters).
Training utilizes the Adam optimizer (learning rate , batch size 10, for 10 epochs) on the MeetingBank dataset. Inference is performed with greedy decoding (temperature $0$), which ensures determinism for downstream LLM prompting.
Resource requirements are modest (2.1 GB peak GPU memory on V100-class hardware), contrasting with 16–26 GB for alternative LLaMA-based compressors.
5. Empirical Evaluation and Benchmarks
LLMLingua-2 is evaluated on both in-domain and out-of-domain corpora, including MeetingBank (QA and summarization), LongBench, ZeroScrolls, GSM8K, and BBH. Core evaluation metrics include:
- Compression Ratio: (ranges – on target benchmarks).
- End-to-End Latency: Sum of compressor and LLM inference time.
- Speedup: Ratio of inference times with and without compression.
Quantitative Results
| Dataset & Metric | LLMLingua (LLaMA) | LLMLingua-2-small | LLMLingua-2 | Original Prompt |
|---|---|---|---|---|
| MeetingBank QA F1 | 67.5 | 85.8 | 86.9 | 87.8 |
| MeetingBank ROUGE-1 | 38.0 | 48.3 | 48.6 | 47.3 |
| LongBench avg. (5×) | 34.6 | 38.2 | 39.1 | — |
| GSM8K 1-shot EM (5×) | 79.08 | — | 79.08 | — |
| BBH 1-shot acc. (3×) | 70.11 | — | 70.02 | — |
Compressing MeetingBank QA prompts (3×) yields a reduction in end-to-end latency from (no compression) to (2.1× speedup), with compressor time (vs. for LLMLingua). This demonstrates 3×–6× faster prompt compression and 1.6×–2.9× speedup in end-to-end LLM inference, with a minor performance gap versus the original full-length prompt (Pan et al., 19 Mar 2024).
6. Analysis, Ablations, and Generalization
Ablation studies indicate:
- Chunk-wise compression (as opposed to concatenated or one-shot compression) and the precise GPT-4 instruction are critical; altering these increases VR ($9$–) and reduces QA F1 by $8$–$17$ points.
- Cross-lingual generalization is robust: English-trained models transfer well to Chinese LongBench-Zh (ROUGE-1 of $38.1$ vs. baseline $28.6$ at compression).
- Compressed prompts enhance Mistral-7B LLM performance relative to uncompressed originals, suggesting that reducing input length and redundancy can facilitate LLM reasoning and output quality.
7. Limitations and Prospective Extensions
LLMLingua-2’s supervision is currently derived from MeetingBank meeting transcripts, and extending distillation to other domains such as news or encyclopedic text produces only marginal gains, indicating redundancy patterns may generalize. However, coverage expansion and refined sampling could close the remaining faithfulness gap. A plausible implication is that richer, multi-domain extractive supervision may yield further marginal improvements.
Allowing sample-wise dynamic compression ratios under a global token budget can improve downstream performance by $4$–. Future work suggests incorporating light task-aware signals, such as retrieval-based or saliency scores, without sacrificing model compactness or efficiency.
In summary, LLMLingua-2 operationalizes prompt compression as a distilled, supervised extractive classification task using compact Transformer encoders, rigorously outperforming entropy-based approaches on speed, faithfulness, and broad generalizability (Pan et al., 19 Mar 2024).