LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (2310.05736v2)

Published 9 Oct 2023 in cs.CL and cs.LG

Abstract: LLMs have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between LLMs. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.

PDF HTML Abstract

Overview of the Paper LLMLingua: Compressing Prompts for Accelerated Inference of LLMs

Introduction

The paper authored by Huiqiang Jiang et al., titled “LLMLingua: Compressing Prompts for Accelerated Inference of LLMs,” addresses a vital challenge in the efficient utilization of LLMs: the computational demand associated with lengthy prompts. In modern AI applications, prompts can exceed tens of thousands of tokens, making computational efficiency a crucial factor. The authors propose LLMLingua, a comprehensive prompt compression strategy aimed at improving inference speed and reducing costs without compromising performance.

Methodology

LLMLingua is structured around three key components: a budget controller, a token-level iterative compression algorithm, and an instruction tuning method for distribution alignment.

Budget Controller

The budget controller allocates different compression ratios to various prompt segments (instructions, demonstrations, and questions). Initially, it dynamically allocates compression ratios based on the significance of each segment, ensuring crucial information is retained. The controller ranks and selectively retains demonstrations based on their perplexity, which is computed using a small LLM.

Token-Level Iterative Compression

The iterative compression algorithm addresses interdependence issues between tokens by segmenting the prompt and iteratively compressing at the token level. This method considers conditional probabilities to ensure minimal loss of semantic integrity. Unlike single-pass compression, this iterative approach refines the compression process to maintain the coherence of the prompt.

Distribution Alignment

To bridge the discrepancy between the small LLM used for compression and target LLMs, the authors introduce an instruction tuning method. This involves fine-tuning the small LLM using data generated by the target LLM to achieve better alignment in the compression process.

Experimental Results

The efficacy of LLMLingua was validated on four diverse datasets: GSM8K and BBH for reasoning and in-context learning (ICL), ShareGPT for conversations, and Arxiv-March23 for summarization. The results were impressive, demonstrating state-of-the-art performance with up to 20x compression ratios and minimal performance loss.

GSM8K and BBH: These datasets focus on mathematical and logical reasoning. The experiments showed that LLMLingua managed to maintain reasoning capabilities even at high compression ratios (up to 20x), with performance close to the full-shot prompts.
ShareGPT and Arxiv-March23: For conversational and summarization tasks, the method also performed exceptionally well under different compression constraints, achieving a high BERTScore F1 while substantially reducing prompt length.

The paper also detailed an extensive ablation paper validating the significance of each component in the proposed system. The inclusion of an iterative token-level compression and a budget controller was shown to substantially improve the performance and robustness of the compression technique.

Implications and Future Directions

The practicality of LLMLingua extends beyond computational savings. By significantly reducing the token length required for prompting, this method allows LLMs to process longer contexts and efficiently handle more extensive inputs, a critical advantage for real-world applications like automated conversation agents and document summarization systems.

Looking ahead, the research proposes avenues like integrating compression mechanisms directly within LLMs and developing adaptive compression ratios that dynamically adjust based on the prompt’s context. Another potential direction is the exploration of guided generation techniques where models can proactively suggest optimal compressed prompts, further enhancing efficiency.

Conclusion

This paper presents a significant advancement in the optimization of LLM inference through the novel technique of prompt compression. LLMLingua systematically maintains the integrity and performance of compressed prompts, demonstrating its utility across various domain-specific tasks. Future advancements could see even more refined approaches to compression, thereby pushing the boundaries of efficient and scalable AI deployment.