Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Candidate Speculative Decoding (2401.06706v1)

Published 12 Jan 2024 in cs.CL

Abstract: LLMs have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a sequence of tokens) from a fast draft model that is then verified in parallel by the target model. However, the acceptance rate of candidate tokens receives limitations from several factors, such as the model, the dataset, and the decoding setup. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. Our approach shows significant improvements in acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  8. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  9. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  10. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  11. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907.
  12. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  14. Online speculative decoding. arXiv preprint arXiv:2310.07177.
  15. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
  16. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  17. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  18. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  19. Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668.
  20. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31.
  21. Spectr: Fast speculative decoding via optimal transport. In Thirty-seventh Conference on Neural Information Processing Systems.
  22. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  23. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  25. Attention is all you need. Advances in neural information processing systems, 30.
  26. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  27. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sen Yang (191 papers)
  2. Shujian Huang (106 papers)
  3. Xinyu Dai (116 papers)
  4. Jiajun Chen (125 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com