Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Speculative Decoding (2310.07177v4)

Published 11 Oct 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Speculative decoding is a pivotal technique to accelerate the inference of LLMs by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at https://github.com/LiuXiaoxuanPKU/OSD.

Insights into "Online Speculative Decoding"

This paper presents a methodological advancement in the inference process of LLMs by introducing a technique termed "Online Speculative Decoding". The work addresses the challenge of optimizing the latency of LLMs, which is crucial given their increasing deployment in applications with stringent latency requirements such as search engines, chatbots, and virtual assistants.

Speculative decoding utilizes a smaller draft model to propose output tokens for the target LLM, which can then be verified in parallel by the target LLM. This approach aims to accelerate the token generation process by pre-emptively selecting potential outputs. However, the primary bottleneck with this method is the accuracy of the draft model's predictions, especially when there is a significant capability gap between the draft and the target model.

The paper introduces an innovative method called online speculative decoding, which dynamically updates the draft model based on live query data, leveraging the surplus computational power typically found in LLM serving clusters. This approach involves online knowledge distillation to improve the draft model's predictive performance on the current query distribution. Such continuous adaptation allows for real-time alignment of the draft model with the prevailing query patterns, thus mitigating the effects of distribution shifts and enhancing the overall efficiency of speculative decoding.

The authors report substantial improvements in the token acceptance rate, achieving increases between 0.1 to 0.65. This, in turn, results in a latency reduction of 1.22x to 3.06x as measured across several popular LLMs using both synthetic and real query data. These results denote a significant performance improvement compared to static draft models constructed through offline methods.

Implications and Future Directions

The implications of these findings are twofold. Practically, the methodology provides a cost-effective solution for improving LLM service latency, which has direct benefits for user satisfaction and system efficiency in real-world applications. Theoretically, this approach enriches the field of speculative inference by demonstrating the value of dynamic model adaptation to shifting data distributions in real-time.

Looking forward, this research opens pathways for further exploration into the synergy of online learning and inference optimization. For instance, future work might investigate the balance between computational resource usage and model performance enhancement, the scalability of the proposed method to larger LLMs or diverse application domains, and the integration of more sophisticated knowledge distillation techniques to further close the gap between draft and target models.

Overall, this paper contributes a significant innovation to the LLM inference landscape, highlighting the importance of adaptability and resource utilization in modern AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Gaurang Bharti. gbharti/finance-alpaca, 2023. URL https://huggingface.co/datasets/gbharti/finance-alpaca. Accessed: 2023-09-17.
  4. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  9. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  10. Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9161–9171, 2022.
  11. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  12. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
  13. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  14. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
  15. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
  16. Ali Mazhar Luqmani. distilled bert topic, 2023. URL https://huggingface.co/alimazhar-110/website_classification. Accessed: 2023-10-07.
  17. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification, 2023.
  18. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
  20. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
  21. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  22. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  24. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp.  521–538, 2022.
  25. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
  26. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023a.
  27. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaoxuan Liu (21 papers)
  2. Lanxiang Hu (9 papers)
  3. Peter Bailis (44 papers)
  4. Ion Stoica (177 papers)
  5. Zhijie Deng (58 papers)
  6. Alvin Cheung (48 papers)
  7. Hao Zhang (947 papers)
Citations (36)