Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DistillSpec: Improving Speculative Decoding via Knowledge Distillation (2310.08461v2)

Published 12 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Speculative decoding (SD) accelerates LLM inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yongchao Zhou (7 papers)
  2. Kaifeng Lyu (28 papers)
  3. Ankit Singh Rawat (64 papers)
  4. Aditya Krishna Menon (56 papers)
  5. Afshin Rostamizadeh (35 papers)
  6. Sanjiv Kumar (123 papers)
  7. Jean-François Kagy (4 papers)
  8. Rishabh Agarwal (47 papers)
Citations (64)