Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressing Context to Enhance Inference Efficiency of Large Language Models (2310.06201v1)

Published 9 Oct 2023 in cs.CL

Abstract: LLMs achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50\% reduction in context cost, resulting in a 36\% reduction in inference memory usage and a 32\% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance.

Enhancing LLM Inference Efficiency through Context Compression

The paper "Compressing Context to Enhance Inference Efficiency of LLMs" presents Selective Context, a method aimed at improving the efficiency of LLMs when processing long documents and sustained conversations. The authors address the challenge of managing extended contexts, which can strain computational resources and lead to context truncation due to fixed input lengths.

Core Contributions

Selective Context is proposed as a novel approach to prune redundancy in the input context, ensuring more efficient use of fixed context windows in LLMs. Key features of this method include:

  1. Redundancy Identification and Pruning: The approach involves assessing the informativeness of lexical units within the input context, using self-information metrics based on smaller base causal models. Redundant data is systematically removed to make the input more compact.
  2. Experimental Evaluation: The method was tested on diverse data sources requiring long contexts, such as arXiv papers and lengthy news articles. Tasks such as summarization, question answering, and conversation were employed to gauge its effectiveness.

Results and Performance

The experiments reveal that Selective Context achieves notable memory cost reduction and latency decrease, maintaining performance levels akin to using the full context:

  • A 50% reduction in context cost resulted in a 36% decrease in memory usage and a 32% reduction in inference time.
  • There was only a minor performance drop of .023 in BERTScore and .038 in faithfulness across four downstream tasks.

This balance between efficiency and performance illustrates the potential for practical application without sacrificing output quality.

Theoretical and Practical Implications

This paper introduces a model-agnostic, complementary perspective on enhancing LLM efficiency. Unlike narrow architectural optimizations like sparse or local attention, Selective Context can be integrated with existing methods to optimize model inference even further.

Methodological Insights:

  • Self-Information Utilization: By leveraging self-information, which quantifies informativeness, the method effectively identifies and retains only the most pertinent parts of the context.
  • Adaptive Filtering: The application of a percentile-based filtering mechanism allows for dynamic content retention based on the distribution of self-information values, ensuring flexibility across diverse contexts.

Future Directions

The research opens avenues for exploring more granular lexical unit refinement techniques, potentially enhancing the procedure's accuracy. Additionally, the integration of context compression with other efficiency-focused strategies could yield further improvements.

The research showcases the utility of efficiently managing context to reduce computational demands in real-world applications, extending the practicality of LLMs in environments demanding the processing of extensive and complex datasets. The publication of the code and data supports replication and further investigation, contributing to ongoing developments in AI efficiency optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Razvan Bunescu and Oseremen O Uduehi. 2022. Distribution-based measures of surprise for creative language: Experiments with humor and metaphor. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 68–78.
  6. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
  10. Measuring massive multitask language understanding. ArXiv, abs/2009.03300.
  11. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  12. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467.
  13. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  14. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  15. Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
  16. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  17. Attention is all you need. Advances in neural information processing systems, 30.
  18. Asking and answering questions to evaluate the factual consistency of summaries. ArXiv, abs/2004.04228.
  19. What is linguistic redundancy. University of Chicago.
  20. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  21. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yucheng Li (31 papers)
  2. Bo Dong (50 papers)
  3. Chenghua Lin (127 papers)
  4. Frank Guerin (30 papers)
Citations (37)
Youtube Logo Streamline Icon: https://streamlinehq.com