Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speculative Streaming: Fast LLM Inference without Auxiliary Models (2402.11131v1)

Published 16 Feb 2024 in cs.CL, cs.AI, and cs.LG
Speculative Streaming: Fast LLM Inference without Auxiliary Models

Abstract: Speculative decoding is a prominent technique to speed up the inference of a large target LLM based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

Speculative Streaming: Enhancing LLM Inference Efficiency through Single-Model Speculative Decoding

Introduction

The computational demands of LLMs pose significant challenges for deployment, particularly in scenarios with stringent latency requirements. Traditional speculative decoding techniques, which use a two-model system comprising a draft and a target model to predict future tokens, offer a way to accelerate LLM inference but at the cost of increased complexity and resource requirements. This paper introduces Speculative Streaming, a novel single-model approach to speculative decoding that embeds the drafting and verification process within the target model itself, thereby removing the need for an auxiliary model and significantly reducing the parameter overhead.

Motivation

LLM inference is often memory-bound, limiting the effectiveness of conventional speculative decoding approaches that require separate draft and target models. These approaches not only add complexity to the deployment but also are not feasible for resource-constrained devices due to the significant memory overhead. The motivation behind Speculative Streaming is to eliminate these limitations by integrating speculative decoding capabilities directly into the target model, thus streamlining the decoding process and making it more resource-efficient.

Methodology

Speculative Streaming alters the fine-tuning objective of LLMs from next-token prediction to future n-gram prediction, utilizing a concept known as multi-stream attention. This approach allows for simultaneous token verification and future token speculation within a single model pass. Notable adjustments and innovations include:

  • Streams Design and Initialization: Utilizes multiple speculative streams that are initialized at a higher-level layer of the transformer model, each predicting future tokens based on different lookahead positions.
  • Parallel Speculation and Verification: By generating a speculative tree draft and verifying its tokens simultaneously, the method can significantly accelerate the decoding process.
  • Tree Draft Pruning: A novel layer is introduced to prune less probable paths from the speculative tree draft, thereby reducing computation overhead without sacrificing prediction quality.

Experimental Results

The experiments demonstrate Speculative Streaming's effectiveness across various tasks, including Summarization, Structured Queries, and Meaning Representation. When compared to both the standard speculative decoding and recent Medusa-style architectures, Speculative Streaming achieves superior decoding speed-ups (1.8 - 3.1X) without compromising on the generation quality. Furthermore, it dramatically reduces parameter overhead, requiring approximately 10000X fewer additional parameters than its Medusa counterpart, thus making it highly suitable for deployment on resource-constrained devices.

Implications and Future Directions

Speculative Streaming represents a significant step forward in simplifying LLM inference deployment by integrating the speculation and verification process into a single, efficient model. This not only enhances the practicality of LLM deployment in latency-sensitive applications but also opens up new possibilities for further optimization and efficiency improvements in generative AI models. Future research could explore the integration of Speculative Streaming with other model optimization techniques, such as model quantization or pruning, to achieve even greater efficiency gains.

Conclusion

The introduction of Speculative Streaming offers a promising solution to the challenges of deploying large, autoregressive transformer models, particularly in latency-sensitive environments. By combining the processes of speculation and verification within a single model and significantly reducing parameter overhead, this method paves the way for more efficient and practical applications of LLMs across a wide range of domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Burton, F. W. Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, 100(12):1190–1193, 1985.
  4. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. DialogSum: A real-life scenario dialogue summarization dataset. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  5062–5074, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.449. URL https://aclanthology.org/2021.findings-acl.449.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  8. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  9. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156, January 2020. doi: 10.1016/j.csl.2019.06.009.
  10. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.  10323–10337. PMLR, 2023.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  12. Breaking the sequential dependency of llm inference using lookahead decoding, November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/.
  13. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  16. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  17. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  18. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  19. Nvidia. Fastertransformer, 2024. URL https://github.com/NVIDIA/FasterTransformer.
  20. Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023.
  21. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  22. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063, 2020.
  23. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
  24. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  25. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023a.
  26. Spectr: Fast speculative decoding via optimal transport. arXiv preprint arXiv:2310.15141, 2023b.
  27. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  31. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
  32. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  33. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017.
  36. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nikhil Bhendawade (5 papers)
  2. Irina Belousova (4 papers)
  3. Qichen Fu (11 papers)
  4. Henry Mason (7 papers)
  5. Mohammad Rastegari (57 papers)
  6. Mahyar Najibi (38 papers)
Citations (17)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. [R] Faster LLM’s from Apple (20 points, 6 comments)
  2. [R] Faster LLM from Apple (18 points, 8 comments)
  3. [D] Faster LLMs from Apple (1 point, 1 comment)