Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding (2402.02057v1)

Published 3 Feb 2024 in cs.LG and cs.CL
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Abstract: Autoregressive decoding of LLMs is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding

Lookahead Decoding: Enhancing LLM Inference

The paper "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding" addresses a fundamental challenge in the deployment of LLMs: the inefficiency of autoregressive decoding. Autoregressive decoding, a prevalent method for generating sequences in LLMs, has traditionally relied on generating one token at a time. This process not only results in high latency but also underutilizes the parallel processing capabilities of modern accelerators, such as GPUs, due to its memory bandwidth-bound nature.

Key Contributions

The paper introduces Lookahead Decoding, a novel algorithm that accelerates LLM decoding by fundamentally rethinking how sequences are generated. Unlike traditional methods that rely on auxiliary models like speculative decoding, Lookahead Decoding operates without any additional models, focusing instead on leveraging the parallelizable aspects of sequence generation.

  1. Parallel Decoding through nn-grams: Lookahead Decoding formulates the decoding process as solving a non-linear system using the fixed point Jacobi iteration method. This approach allows for the generation of multiple tokens in parallel, potentially integrating several disjoint nn-grams into the final sequence output within a single step.
  2. Efficiency and Compatibility: The algorithm efficiently reduces the number of decoding steps by trading per-step computational effort with the overall generation length, showing up to 1.8x speedup in chat datasets and 4x in code completion tasks with strong scaling across multiple GPUs. It also remains compatible with memory-efficient attention mechanisms, such as FlashAttention.
  3. Scalability: Lookahead Decoding is demonstrated to exhibit strong scalability, linearly reducing decoding steps as a function of per-step FLOPs. This scalability is particularly beneficial for latency-sensitive tasks deployed across multiple GPUs.

Numerical Results

The paper presents compelling numerical results. On the MT-Bench multi-turn chat dataset, Lookahead Decoding achieved a speedup of 1.8x, while code completion tasks saw up to a 4x performance increase with Lookahead Parallelism on 8 GPUs.

Implications and Future Directions

The introduction of Lookahead Decoding offers substantial implications for both theoretical understanding and practical application of LLMs. By eschewing additional models and focusing on inherent parallelization opportunities, the approach significantly reduces latency while maintaining output distribution. It paves the way for further exploration into non-sequential decoding strategies that could harness modern hardware architectures more effectively.

Practically, this methodology can be immediately impactful in fields requiring rapid LLM deployment, such as real-time translation or interactive AI applications. Theoretically, it challenges existing paradigms of LLM inference and encourages future research to explore alternative parallel decoding mechanisms that could further diminish reliance on sequential processes.

Future work might investigate extending Lookahead Decoding to other architectures or application domains beyond NLP, potentially enhancing a wide array of sequence generation tasks. Moreover, the integration of Lookahead Decoding with newly emerging hardware accelerators could uncover additional layers of parallelism and efficiency.

In conclusion, Lookahead Decoding represents a significant advancement in the optimization of LLM inference, offering a promising direction for achieving lower latency inference while maximizing computational resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Automatic tensor parallelism for huggingface models, 2023. URL https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism.
  2. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  3. Program synthesis with large language models, 2021.
  4. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  5. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  6. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
  7. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  8. Evaluating large language models trained on code, 2021.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  11. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  12. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023.
  13. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  14. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  15. The curious case of neural text degeneration, 2020.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Ancestral gumbel-top-k sampling for sampling without replacement. Journal of Machine Learning Research, 21(47):1–36, 2020. URL http://jmlr.org/papers/v21/19-985.html.
  18. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  19. Eagle: Lossless acceleration of llm decoding by feature extrapolation, December 2023. URL https://sites.google.com/view/eagle-llm.
  20. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W04-1013.
  21. Online speculative decoding, 2023.
  22. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  23. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018.
  24. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  28. Best prompting practices for using the llama 2 chat llm through amazon sagemaker jumpstart, November 2023. URL https://aws.amazon.com/cn/blogs/machine-learning/best-prompting-practices-for-using-the-llama-2-chat-llm-through-amazon-sagemaker-jumpstart/.
  29. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12336–12355, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.689.
  30. Saxena, A. Prompt lookup decoding, November 2023. URL https://github.com/apoorvumang/prompt-lookup-decoding/.
  31. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073–1083, 2017.
  32. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  33. Accelerating feedforward computation via parallel nonlinear equation solving, 2021.
  34. Blockwise parallel decoding for deep autoregressive models, 2018.
  35. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  38. Attention is all you need, 2023.
  39. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  40. Inference with reference: Lossless acceleration of large language models, 2023.
  41. Root mean square layer normalization, 2019.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yichao Fu (18 papers)
  2. Peter Bailis (44 papers)
  3. Ion Stoica (177 papers)
  4. Hao Zhang (947 papers)
Citations (88)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com