Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Graph-Structured Speculative Decoding (2407.16207v1)

Published 23 Jul 2024 in cs.CL

Abstract: Speculative decoding has emerged as a promising technique to accelerate the inference of LLMs by employing a small LLM to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
  2. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
  3. Accelerating large language model decoding with speculative sampling. CoRR, abs/2302.01318.
  4. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  5. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  6. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR.
  7. Alex Graves. 2012. Sequence transduction with recurrent neural networks. CoRR, abs/1211.3711.
  8. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252.
  9. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
  10. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR.
  11. Super tickets in pre-trained language models: From model compression to improving generalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6524–6538, Online. Association for Computational Linguistics.
  12. LLM-QAT: data-free quantization aware training for large language models. CoRR, abs/2305.17888.
  13. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  14. Specinfer: Accelerating generative LLM serving with speculative inference and token tree verification. CoRR, abs/2305.09781.
  15. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
  16. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1797–1807. Association for Computational Linguistics.
  17. OpenAI. 2022. Openai chatgpt.
  18. Alp-kd: Attention-based layer projection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13657–13665.
  19. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  20. Benjamin Spector and Christopher Ré. 2023. Accelerating LLM inference with staged speculative decoding. CoRR, abs/2308.04623.
  21. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10107–10116.
  22. Compression of generative pre-trained language models via quantization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4821–4836, Dublin, Ireland. Association for Computational Linguistics.
  23. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  25. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  26. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhuocheng Gong (9 papers)
  2. Jiahao Liu (72 papers)
  3. Ziyue Wang (75 papers)
  4. Pengfei Wu (18 papers)
  5. Jingang Wang (71 papers)
  6. Xunliang Cai (63 papers)
  7. Dongyan Zhao (144 papers)
  8. Rui Yan (250 papers)
Citations (1)