Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cascade Speculative Drafting for Even Faster LLM Inference (2312.11462v4)

Published 18 Dec 2023 in cs.LG and cs.CL

Abstract: Introduced to enhance the efficiency of LLM inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal Cascade optimizes time allocation in drafting for improved efficiency. Combining both cascades, CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments, while maintaining the same output distribution as the target model. Our code is publicly available at https://github.com/lfsszd/CS-Drafting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Efficient 8-bit quantization of transformer neural machine language translation model, 2019.
  2. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  3. The lottery ticket hypothesis for pre-trained bert networks, 2020.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. Training verifiers to solve math word problems, 2021.
  6. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, March 2021. ISSN 1573-1405. doi: 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z.
  7. Parameter-efficient transfer learning with diff pruning, 2021.
  8. Rest: Retrieval-based speculative decoding, 2023.
  9. Measuring massive multitask language understanding, 2021.
  10. Distilling the knowledge in a neural network, 2015.
  11. Speculative decoding with big little decoder, 2023.
  12. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  14. Online speculative decoding, 2023.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. Mixed precision post training quantization of neural networks with sensitivity guided search, 2023.
  17. Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  01–15. IEEE, 2022.
  18. Efficient methods for natural language processing: A survey, 2023.
  19. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, 2019.
  20. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  21. Structured pruning learns compact and accurate models, 2022.
  22. Distillspec: Improving speculative decoding via knowledge distillation, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ziyi Chen (37 papers)
  2. Xiaocong Yang (6 papers)
  3. Jiacheng Lin (22 papers)
  4. Chenkai Sun (11 papers)
  5. Jie Huang (155 papers)
  6. Kevin Chen-Chuan Chang (53 papers)
Citations (35)
Github Logo Streamline Icon: https://streamlinehq.com