Compressing Context to Enhance Inference Efficiency of Large Language Models
Abstract: LLMs achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50\% reduction in context cost, resulting in a 36\% reduction in inference memory usage and a 32\% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Razvan Bunescu and Oseremen O Uduehi. 2022. Distribution-based measures of surprise for creative language: Experiments with humor and metaphor. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 68–78.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Measuring massive multitask language understanding. ArXiv, abs/2009.03300.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. Advances in neural information processing systems, 30.
- Asking and answering questions to evaluate the factual consistency of summaries. ArXiv, abs/2004.04228.
- What is linguistic redundancy. University of Chicago.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.