AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability (2410.18351v1)
Abstract: Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern LLMs. The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.
- Fast inference from transformers via speculative decoding, 2023. URL https://arxiv.org/abs/2211.17192.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
- The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/abs/2212.08073.
- Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
- Agentbench: Evaluating llms as agents, 2023a. URL https://arxiv.org/abs/2308.03688.
- Visual instruction tuning, 2023b. URL https://arxiv.org/abs/2304.08485.
- Tinyllama: An open-source small language model, 2024a. URL https://arxiv.org/abs/2401.02385.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://www.aclweb.org/anthology/P17-1099.
- Wikimedia Foundation. Acl 2019 fourth conference on machine translation (wmt19), shared task: Machine translation of news. URL http://www.statmt.org/wmt19/translation-task.html.
- Speculative decoding with big little decoder, 2023. URL https://arxiv.org/abs/2302.07863.
- Draft & verify: Lossless large language model acceleration via self-speculative decoding, 2024b. URL https://arxiv.org/abs/2309.08168.
- Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2024. URL https://arxiv.org/abs/2405.19715.
- Dynamic speculation lookahead accelerates speculative decoding of large language models, 2024. URL https://arxiv.org/abs/2405.04304.
- Direct alignment of draft model for speculative decoding with chat-fine-tuned llms, 2024. URL https://arxiv.org/abs/2403.00858.
- Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373.
- Markov Chains and Mixing Times. 2008. URL https://api.semanticscholar.org/CorpusID:117035435.
- Olivier Rioul. A historical perspective on schützenberger-pinsker inequalities (extended version). Information Geometry, May 2024. ISSN 2511-249X. doi: 10.1007/s41884-024-00138-z. URL http://dx.doi.org/10.1007/s41884-024-00138-z.
- The stable entropy hypothesis and entropy-aware decoding: An analysis and algorithm for robust natural language generation, 2023. URL https://arxiv.org/abs/2302.06784.
Collections
Sign up for free to add this paper to one or more collections.