Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs (2405.15208v1)
Abstract: LLMs have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained LLM can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\% speed up on code generation with a negligible quality loss of 3\%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future LLMs, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-. Keywords: Parallel Decoding, Lexical Unit Decoding, LLM
- Palm 2 technical report. CoRR, abs/2305.10403.
- Controlling computation versus quality for neural sequence models. CoRR, abs/2002.07106.
- A neural probabilistic language model. In Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pages 932–938. MIT Press.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
- Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
- Evaluating large language models trained on code. CoRR, abs/2107.03374.
- Phoenix: Democratizing chatgpt across languages. CoRR, abs/2304.10453.
- Hao Cheng and Zhihua Zhang. 2022. MR-P: A Parallel Decoding Algorithm for Iterative Refinement Non-Autoregressive Translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 285–296, Dublin, Ireland. Association for Computational Linguistics.
- Unified scaling laws for routed language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 4057–4086. PMLR.
- Depth-adaptive transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Watermarking conditional text generation for ai detection: Unveiling challenges and a semantic-aware watermark remedy.
- Chinese WPLC: A Chinese dataset for evaluating pretrained language models on word prediction given long-range context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3770–3778, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Non-autoregressive neural machine translation. CoRR, abs/1711.02281.
- Scaling laws for transfer. CoRR, abs/2102.01293.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Training compute-optimal large language models. CoRR, abs/2203.15556.
- Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res., 18:187:1–187:30.
- Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9895–9907.
- Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2395–2404. PMLR.
- Scaling laws for neural language models. CoRR, abs/2001.08361.
- A watermark for large language models.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR.
- Tab-CQA: A tabular conversational question answering dataset on financial reports. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 196–207, Toronto, Canada. Association for Computational Linguistics.
- X-RiSAWOZ: High-quality end-to-end multilingual dialogue datasets and few-shot agents. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2773–2794, Toronto, Canada. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Glancing Transformer for Non-Autoregressive Neural Machine Translation. ArXiv:2008.07905 [cs].
- Yuqi Ren and Deyi Xiong. 2023. HuaSLIM: Human attention motivated shortcut learning identification and mitigation for large language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12350–12365, Toronto, Canada. Association for Computational Linguistics.
- Why should we add early exits to neural networks? Cogn. Comput., 12(5):954–966.
- Consistent accelerated inference via confident adaptive transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 4962–4979. Association for Computational Linguistics.
- The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6640–6651. Association for Computational Linguistics.
- Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423.
- Primer: Searching for efficient transformers for language modeling. CoRR, abs/2109.08668.
- Blockwise Parallel Decoding for Deep Autoregressive Models. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Instantaneous grammatical error correction with shallow aggressive decoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5937–5947. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Chunking up speech in real time: linguistic predictors and cognitive constraints. Language and Cognition, 15(3):453–479.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
- A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond. ArXiv:2204.09269 [cs].
- Improving code generation by dynamic temperature sampling. CoRR, abs/2309.02772.
- Representation engineering: A top-down approach to ai transparency.