LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (2310.05736v2)
Abstract: LLMs have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between LLMs. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.
- 2023. Sharegpt. https://sharegpt.com/.
- Types of out-of-distribution texts and how to detect them. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10687–10701, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations.
- Harrison Chase. 2022. LangChain.
- Adapting language models to compress contexts. ArXiv preprint, abs/2305.14788.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168.
- Language modeling is compression. ArXiv preprint, abs/2309.10668.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
- Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
- Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. ArXiv preprint, abs/2305.17306.
- Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations.
- Extensible prompts for language models. ArXiv preprint, abs/2212.00616.
- In-context autoencoder for context compression in a large language model. ArXiv preprint, abs/2307.06945.
- Semantic compression with large language models. ArXiv preprint, abs/2304.12512.
- Power-bert: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3690–3699. PMLR.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6501–6511, Online. Association for Computational Linguistics.
- Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794.
- Yucheng Li. 2023. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. ArXiv preprint, abs/2304.12102.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Self-supervised losses for one-class textual anomaly detection. ArXiv preprint, abs/2204.05695.
- AdapLeR: Speeding up inference by adaptive length reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–15, Dublin, Ireland. Association for Computational Linguistics.
- Learning to compress prompts with gist tokens. ArXiv preprint, abs/2304.08467.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Richard Clark Pasco. 1976. Source coding algorithms for fast data compression. Ph.D. thesis, Citeseer.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems.
- Jorma J Rissanen. 1976. Generalized kraft inequality and arithmetic coding. IBM Journal of research and development, 20(3):198–203.
- Claude E Shannon. 1951. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
- Ilya Sutskever. 2023. A theory of unsupervised learning. https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023-08-14.
- Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Multi-level knowledge distillation for out-of-distribution detection in text. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Long Papers).
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning.
- Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244.
- Inference with reference: Lossless acceleration of large language models. ArXiv preprint, abs/2304.04487.
- Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5754–5764.
- Mlcopilot: Unleashing the power of large language models in solving machine learning tasks. ArXiv preprint, abs/2304.14979.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Efficient prompting via dynamic in-context learning. ArXiv preprint, abs/2305.11170.
- Huiqiang Jiang (32 papers)
- Qianhui Wu (19 papers)
- Chin-Yew Lin (22 papers)
- Yuqing Yang (83 papers)
- Lili Qiu (50 papers)