Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoding Speculative Decoding (2402.01528v3)

Published 2 Feb 2024 in cs.LG and cs.CL

Abstract: Speculative Decoding is a widely used technique to speed up inference for LLMs without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this work, we perform a detailed study comprising over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate the factors that affect the performance gain provided by speculative decoding. Our experiments indicate that the performance of speculative decoding depends heavily on the latency of the draft model, and the draft model's capability in LLMing does not correlate strongly with its performance in speculative decoding. Based on these insights we explore a new design space for draft models and design hardware-efficient draft models for speculative decoding. Our newly designed draft model for LLaMA-65B can provide 111% higher throughput than existing draft models and can generalize further to the LLaMA-2 model family and supervised fine-tuned models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  4. Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  5. Flashdecoding. https://pytorch.org/blog/flash-decoding/, 2023. Accessed: January 26, 2024.
  6. Habitat: A {{\{{Runtime-Based}}\}} computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp.  503–521, 2021.
  7. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  8. Computer architecture: a quantitative approach. Elsevier, 2011.
  9. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  10. Flashdecoding++: Faster large language model inference on gpus. arXiv preprint arXiv:2311.01282, 2023.
  11. HuggingFace. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023. Accessed: January 26, 2024.
  12. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  13. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  14. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  15. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023a.
  16. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp.  22137–22176. PMLR, 2023b.
  17. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  18. Microsoft. Deepspeed. https://github.com/microsoft/deepspeed, 2023. Accessed: January 26, 2024.
  19. Cheaply estimating inference efficiency metrics for autoregressive transformer models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  20. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
  21. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  22. The synergy of speculative decoding and batching in serving large language models. arXiv preprint arXiv:2310.18813, 2023.
  23. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  25. Lightseq: A high performance inference library for transformers. arXiv preprint arXiv:2010.13887, 2020.
  26. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  3909–3925, 2023a.
  27. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023b.
  28. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2307.05908, 2023.
  29. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp.  521–538, 2022.
  30. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  31. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  33. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Minghao Yan (8 papers)
  2. Saurabh Agarwal (19 papers)
  3. Shivaram Venkataraman (48 papers)
Citations (2)