LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression (2310.06839v2)

Published 10 Oct 2023 in cs.CL and cs.LG

Abstract: In long context scenarios, LLMs face three main challenges: higher computational cost, performance reduction, and position bias. Research indicates that LLM performance hinges on the density and position of key information in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. Our extensive evaluation across various long context scenarios demonstrates that LongLLMLingua not only enhances performance but also significantly reduces costs and latency. For instance, in the NaturalQuestions benchmark, LongLLMLingua boosts performance by up to 21.4% with around 4x fewer tokens in GPT-3.5-Turbo, leading to substantial cost savings. It achieves a 94.0% cost reduction in the LooGLE benchmark. Moreover, when compressing prompts of about 10k tokens at ratios of 2x-6x, LongLLMLingua can accelerate end-to-end latency by 1.4x-2.6x. Our code is available at https://aka.ms/LongLLMLingua.

PDF Abstract

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Introduction

The paper "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression" addresses significant challenges faced by LLMs when processing extensive context. These challenges include higher computational and financial costs, longer latency, and degraded performance. Previous studies have noted that the effectiveness of LLMs is influenced by the density and position of key information within the input. Building on these insights, LongLLMLingua proposes a technique for prompt compression to enhance the perception of key information by LLMs, thus alleviating the identified challenges.

Key Contributions

The paper makes several notable contributions:

Question-Aware Coarse-to-Fine Compression:
- The authors introduce a question-aware coarse-to-fine compression method. This method incrementally compresses the prompt by focusing first on a high-level coarse compression and then on fine-grained token-level compression, thus effectively concentrating the key information relevant to the question.
Document Reordering Mechanism:
- A document reordering mechanism is proposed to mitigate information loss that frequently occurs when relevant information is placed in the middle of long contexts. By reordering documents based on their relevance scores, derived via coarse-grained compression, the key information is placed at positions where LLMs can more effectively process it.
Dynamic Compression Ratios:
- To better control the level of compression applied to different documents, the authors present dynamic compression ratios. This allows for adaptive granular control during fine-grained compression, ensuring that more relevant documents retain a higher amount of original content.
Post-Compression Subsequence Recovery:
- A subsequence recovery strategy is proposed to restore the integrity of information that may have been compromised during compression. This ensures that key entities and other critical details are accurately preserved in the compressed prompt.

Methodology

Problem Formulation and Approach

The problem is framed as an optimization problem where the goal is to compress a given prompt while preserving the distribution of the target LLM's output as close as possible to the distribution obtained from the original prompt. The process incorporates both token-level subsequence selection and document reordering.

Coarse-Grained Compression:

The authors calculate a relevance score r_k for each document in the prompt using question-conditioned perplexities. Irrelevant documents are discarded to reduce noise in the compressed prompt.

Fine-Grained Compression:

The token-level importance within the documents retained after coarse compression is calculated using contrastive perplexity. This ensures that only the most relevant tokens are retained, further compressing the prompt.

Document Reordering:

Documents are reordered based on their relevance scores to position the most pertinent information at the beginning or end of the prompt, where LLMs are more effective.

Subsequence Recovery:

During response generation, a token-level subsequence recovery method is employed to correct potential distortions in key information caused by token removal, thus improving the accuracy and reliability of the LLM’s output.

Experimental Results

The efficacy of LongLLMLingua was evaluated on several benchmarks encompassing various long context scenarios, including multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. Key results include:

Performance Improvement:

On the NaturalQuestions benchmark, LongLLMLingua achieved performance gains of up to 17.1% over the original prompt with approximately 4x fewer input tokens to GPT-3.5-Turbo.

Cost Savings:

LongLLMLingua demonstrated substantial financial savings, reducing inference costs by \$28.5 and \$27.4 per 1,000 samples on the LongBench and ZeroScrolls benchmarks, respectively.

Latency Reduction:

When prompts of approximately 10,000 tokens were compressed at rates between 2x to 10x, the end-to-end latency was reduced by 1.4x to 3.8x.

Implications and Future Work

Practical Implications

LongLLMLingua has practical implications for efficiently deploying LLMs in cost-sensitive and latency-critical applications, particularly those involving long context scenarios such as extensive document retrieval, legal text analysis, and scientific literature summarization.

Theoretical Implications

The work provides insights into the importance of information structuring within prompts and suggests further exploration into optimizing information retrieval and alignment techniques. The proposed methods could be extended to other domains, such as employing different kinds of conditioning for relevance score calculations.

Future Directions

Future research may focus on integrating LongLLMLingua with other LLM frameworks to further improve its applicability and efficiency. Additionally, the development of more sophisticated relevance metrics and advanced sequence recovery techniques could enhance performance further.

Conclusion

LongLLMLingua provides a sophisticated approach to managing long contexts in LLMs, addressing both efficiency and performance issues through innovative prompt compression techniques. The experimental results affirm the method's efficacy, and its broader applicability suggests significant potential for optimizing LLM performance in various real-world applications.