Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism (2506.01979v1)

Published 16 May 2025 in cs.DC and cs.AI

Abstract: Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework \textbf{SpecBranch} to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we strategically introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ speedups against the auto-regressive decoding and reduces rollback tokens by $\textbf{50}$\% for poorly aligned models, realizing its applicability for real-world deployments.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Yuhao Shen (1 paper)
Junyi Shen (12 papers)
Quan Kong (20 papers)
Tianyu Liu (177 papers)
Yao Lu (212 papers)
Cong Wang (310 papers)

Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism (2506.01979v1)

Related Papers