Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding (2410.05589v1)

Published 8 Oct 2024 in cs.CL and cs.LG

Abstract: Speculative decoding has proven to be an efficient solution to LLM inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in LLMing, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost. Experimental results show that ParallelSpec accelerates baseline methods in latency up to 62% on text generation benchmarks from different domains, and it achieves 2.84X overall speedup on the Llama-2-13B model using third-party evaluation criteria.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zilin Xiao (9 papers)
  2. Hongming Zhang (111 papers)
  3. Tao Ge (53 papers)
  4. Siru Ouyang (22 papers)
  5. Vicente Ordonez (52 papers)
  6. Dong Yu (329 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.