Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights (2503.07807v2)

Published 10 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Speculative decoding is an effective method for accelerating inference of LLMs by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Fenglu Hong (3 papers)
  2. Ravi Raju (4 papers)
  3. Jonathan Lingjie Li (1 paper)
  4. Bo Li (1107 papers)
  5. Urmish Thakker (26 papers)
  6. Avinash Ravichandran (35 papers)
  7. Swayambhoo Jain (19 papers)
  8. Changran Hu (10 papers)