Improving Speculative Decoding with Block-Level Draft Verification
Introduction
Speculative decoding has become a prominent approach for accelerating the inference of LLMs by drafting blocks of tokens using a smaller model and verifying these tokens with the larger target model in parallel. However, the prevalent draft verification algorithm, Spectoken, verifies tokens independently, which may not be optimal. In this work, we present a novel formulation of the draft verification step as a block-level optimal transport problem, leading to a more efficient draft verification algorithm that enhances the speedup of speculative decoding without incurring extra computational costs.
The Optimal Transport Problem Formulation
The crux of our approach hinges on the formulation of the draft verification process as an optimal transport problem at the block level, with the aim of maximizing the expected number of accepted tokens in one draft block—directly correlating with the decoding speed-up. We propose a verification algorithm that guarantees optimal acceptance length for this block-level transport problem. This formulation leads to a clear improvement over the previously used token-level verification methods.
The Proposed Algorithm: Specblock
Our proposed algorithm, Specblock, efficiently computes the acceptance length without requiring additional calls to both the small and large models. It first attempts a maximal coupling on the entire block. If rejected, it engages in a backward induction process, deciding on partial acceptance based on the remaining and rejected probability masses for continuations of the draft. This backward induction ensures that the accepted tokens and subsequent corrections align closely with the distribution defined by the large model while efficiently computing the conditional distributions for corrected tokens.
Experimental Validation
We compared Specblock against the standard Spectoken algorithm across various datasets and tasks, including LLMing, reasoning queries, summarization, and translation tasks. Our experiments reveal consistent improvements in both block efficiency and wall clock speedup with Specblock. Particularly, we observed larger improvements for larger block lengths, showcasing the scalability of our approach.
Theoretical Justification
A formal analysis of our approach reveals that Specblock is optimal concerning the formulated block-level optimal transport problem. It provides the maximum expected accepted length, thereby offering theoretical underpinning to the empirical improvements observed.
Future Directions
Our work opens several avenues for future exploration. Notably, the combination of optimizing the drafting phase with our improved draft verification algorithm presents a promising direction for further enhancing speculative decoding's efficiency. Additionally, exploring the implications of block-level verification beyond speculative decoding in the broader context of accelerating LLMs warrants attention.
Conclusion
Specblock represents a significant advancement in the pursuit of efficient speculative decoding by optimizing the draft verification phase through block-level verification. This approach not only achieves theoretical optimality but also demonstrates practical improvements in speedup across a spectrum of tasks and datasets. As LLMs continue to grow in size and computational demand, innovations like Specblock will be vital in making these models more accessible and practical for a broader range of applications.