Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression (2403.12968v2)

Published 19 Mar 2024 in cs.CL and cs.LG
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal LLM such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. Our code is available at https://aka.ms/LLMLingua-2.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Abstract

This document presents an overview and an in-depth analysis of the research presented in "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression." The paper addresses the challenges of prompt compression in the context of LLMs, presenting LLMLingua-2, a novel approach designed to streamline and enhance the efficiency of prompt compression while ensuring the retention of critical information.

Introduction

Recent advancements in prompting techniques for LLMs, such as Chain-of-Thought (COT), In-context Learning (ICL), and Retrieval Augmented Generation (RAG), have significantly extended the capabilities of these models. However, lengthy prompts, though rich in information, pose substantial computational and financial burdens. Prompt compression aims to mitigate these issues by reducing prompt length without compromising essential information.

LLMLingua-2 departs from existing methods relying on information entropy by introducing a data distillation procedure that leverages knowledge from an LLM to achieve more efficient and faithful prompt compression. This method is task-agnostic, enhancing generalizability and efficiency across various applications.

Methodology

Data Distillation Procedure

LLMLingua-2's data distillation procedure involves using GPT-4 to generate a text compression dataset composed of original and compressed text pairs. This dataset is constructed by prompting GPT-4 to compress texts according to specific instructions focused on retaining crucial information while eliminating redundancy. The prompt compression task is reframed as a token classification problem, allowing a Transformer encoder to leverage bidirectional context for optimal compression.

Extractive Text Compression Dataset

The dataset comprises original texts from MeetingBank and their compressed counterparts, annotated to indicate whether each token should be preserved or discarded. Quality control metrics, such as Variation Rate (VR) and Alignment Gap (AG), ensure the fidelity and effectiveness of the annotation process.

Model Architecture and Training

The token classification model employs a Transformer encoder as the feature extractor, followed by a linear classification layer to predict token retention probabilities. The model is trained on the MeetingBank compression dataset using cross-entropy loss. Crucially, this approach guarantees the faithfulness of the compressed prompts by maintaining the original token sequence and leveraging bidirectional context.

Results

In-Domain Evaluation

The model's performance was evaluated on both QA and summarization tasks within the MeetingBank dataset. LLMLingua-2 demonstrated significant improvements over existing baselines, including Selective-Context and the original LLMLingua. Notably, despite being smaller than LLaMA-2-7B, LLMLingua-2 outperformed these models in terms of QA F1 scores and summary metrics such as BLEU, ROUGE, and BERTScore.

Out-of-Domain Evaluation

The robustness of LLMLingua-2 was further tested on long-context datasets such as LongBench and ZeroSCROLLS, as well as reasoning benchmarks like GSM8K and BBH. The results underscored LLMLingua-2's superior generalizability, achieving higher performance compared to task-agnostic baselines. Even the smaller LLMLingua-2 model (based on multilingual-BERT) surpassed the performance of LLaMA-2-7B-based models.

Efficiency and Latency

LLMLingua-2's model size and efficiency contribute to significant reductions in latency and GPU memory usage. The model accelerates end-to-end latency by 1.6x to 2.9x, offering a compelling advantage in practical deployments. Additionally, the model's peak GPU memory usage is considerably lower than that of comparative models, further enhancing its applicability in resource-constrained environments.

Implications and Future Directions

LLMLingua-2's approach to prompt compression represents a significant stride in improving the efficiency and reliability of LLM applications. The task-agnostic nature of the model ensures broad applicability, while the data distillation procedure guarantees high-quality compression without sacrificing essential information. Future research could explore extending the dataset to cover a wider range of domains, enhancing the model's generalizability further.

Overall, LLMLingua-2 sets a new standard for prompt compression in LLMs, balancing efficiency and fidelity to meet the demands of diverse real-world applications. The model's integration with existing compression frameworks and potential for expansion points to a promising trajectory for ongoing advancements in AI-driven language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Longbench: A bilingual, multitask benchmark for long context understanding. ArXiv preprint, abs/2308.14508.
  2. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  3. Walking down the memory maze: Beyond context limit through interactive reading. ArXiv preprint, abs/2310.05029.
  4. Adapting language models to compress contexts. ArXiv preprint, abs/2305.14788.
  5. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. A survey for in-context learning. ArXiv preprint, abs/2301.00234.
  9. Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1481–1491, Seattle, Washington, USA. Association for Computational Linguistics.
  10. In-context autoencoder for context compression in a large language model. ArXiv preprint, abs/2307.06945.
  11. Meetingbank: A benchmark dataset for meeting summarization. ArXiv preprint, abs/2305.17529.
  12. Boosting llm reasoning: Push the limits of few-shot learning with reinforced in-context pruning. ArXiv preprint, abs/2312.08901.
  13. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore. Association for Computational Linguistics.
  14. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839.
  15. Hoyoun Jung and Kyung-Joong Kim. 2023. Discrete prompt compression with reinforcement learning. ArXiv preprint, abs/2308.08758.
  16. Abstractive summarization of Reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531, Minneapolis, Minnesota. Association for Computational Linguistics.
  17. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  18. Mahnaz Koupaee and William Yang Wang. 2018. Wikihow: A large scale text summarization dataset. ArXiv preprint, abs/1810.09305.
  19. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  20. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore. Association for Computational Linguistics.
  21. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  22. Lost in the middle: How language models use long contexts. ArXiv preprint, abs/2307.03172.
  23. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Thirty-seventh Conference on Neural Information Processing Systems.
  24. Learning to compress prompts with gist tokens. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Memgpt: Towards llms as operating systems. ArXiv preprint, abs/2310.08560.
  26. Allen Roush and Arvind Balaji. 2020. DebateSum: A large-scale argument mining and summarization dataset. In Proceedings of the 7th Workshop on Argument Mining, pages 1–7, Online. Association for Computational Linguistics.
  27. Zeroscrolls: A zero-shot benchmark for long text understanding. ArXiv preprint, abs/2305.14196.
  28. Claude E Shannon. 1951. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
  29. A dataset and evaluation metrics for abstractive compression of sentences and short paragraphs. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 340–350, Austin, Texas. Association for Computational Linguistics.
  30. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  31. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations.
  33. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations.
  34. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  35. Reducing quantity hallucinations in abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2237–2249, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Zhuoshi Pan (9 papers)
  2. Qianhui Wu (19 papers)
  3. Huiqiang Jiang (32 papers)
  4. Menglin Xia (14 papers)
  5. Xufang Luo (25 papers)
  6. Jue Zhang (43 papers)
  7. Qingwei Lin (81 papers)
  8. Victor Rühle (18 papers)
  9. Yuqing Yang (83 papers)
  10. Chin-Yew Lin (22 papers)
  11. H. Vicky Zhao (22 papers)
  12. Lili Qiu (50 papers)
  13. Dongmei Zhang (193 papers)
Citations (48)