Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Block Verification Accelerates Speculative Decoding (2403.10444v2)

Published 15 Mar 2024 in cs.LG, cs.CL, cs.DS, cs.IT, and math.IT

Abstract: Speculative decoding is an effective method for lossless acceleration of LLMs during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.

Improving Speculative Decoding with Block-Level Draft Verification

Introduction

Speculative decoding has become a prominent approach for accelerating the inference of LLMs by drafting blocks of tokens using a smaller model and verifying these tokens with the larger target model in parallel. However, the prevalent draft verification algorithm, Spectoken, verifies tokens independently, which may not be optimal. In this work, we present a novel formulation of the draft verification step as a block-level optimal transport problem, leading to a more efficient draft verification algorithm that enhances the speedup of speculative decoding without incurring extra computational costs.

The Optimal Transport Problem Formulation

The crux of our approach hinges on the formulation of the draft verification process as an optimal transport problem at the block level, with the aim of maximizing the expected number of accepted tokens in one draft block—directly correlating with the decoding speed-up. We propose a verification algorithm that guarantees optimal acceptance length for this block-level transport problem. This formulation leads to a clear improvement over the previously used token-level verification methods.

The Proposed Algorithm: Specblock

Our proposed algorithm, Specblock, efficiently computes the acceptance length without requiring additional calls to both the small and large models. It first attempts a maximal coupling on the entire block. If rejected, it engages in a backward induction process, deciding on partial acceptance based on the remaining and rejected probability masses for continuations of the draft. This backward induction ensures that the accepted tokens and subsequent corrections align closely with the distribution defined by the large model while efficiently computing the conditional distributions for corrected tokens.

Experimental Validation

We compared Specblock against the standard Spectoken algorithm across various datasets and tasks, including LLMing, reasoning queries, summarization, and translation tasks. Our experiments reveal consistent improvements in both block efficiency and wall clock speedup with Specblock. Particularly, we observed larger improvements for larger block lengths, showcasing the scalability of our approach.

Theoretical Justification

A formal analysis of our approach reveals that Specblock is optimal concerning the formulated block-level optimal transport problem. It provides the maximum expected accepted length, thereby offering theoretical underpinning to the empirical improvements observed.

Future Directions

Our work opens several avenues for future exploration. Notably, the combination of optimizing the drafting phase with our improved draft verification algorithm presents a promising direction for further enhancing speculative decoding's efficiency. Additionally, exploring the implications of block-level verification beyond speculative decoding in the broader context of accelerating LLMs warrants attention.

Conclusion

Specblock represents a significant advancement in the pursuit of efficient speculative decoding by optimizing the draft verification phase through block-level verification. This approach not only achieves theoretical optimality but also demonstrates practical improvements in speedup across a spectrum of tasks and datasets. As LLMs continue to grow in size and computational demand, innovations like Specblock will be vital in making these models more accessible and practical for a broader range of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Semantic parsing on Freebase from question-answer pairs. In D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, and S. Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1160.
  3. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439, Apr. 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239.
  4. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W14/W14-3302.
  5. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
  6. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a.
  7. Cascade speculative drafting for even faster llm inference. arXiv preprint arXiv:2312.11462, 2023b.
  8. Sequoia: Scalable, robust, and hardware-aware speculative decoding, 2024.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  12. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  14. Eagle: Speculative sampling requires rethinking feature uncertainty, 2024.
  15. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  16. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
  17. N. Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  18. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  19. Spectr: Fast speculative decoding via optimal transport. arXiv preprint arXiv:2310.15141, 2023.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  21. C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  22. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024.
  23. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023.
  24. Distillspec: Improving speculative decoding via knowledge distillation, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ziteng Sun (29 papers)
  2. Jae Hun Ro (7 papers)
  3. Ahmad Beirami (86 papers)
  4. Ananda Theertha Suresh (73 papers)
  5. Uri Mendlovic (7 papers)
  6. Yaniv Leviathan (8 papers)
  7. Asaf Aharoni (2 papers)
Citations (1)