Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recurrent Drafter for Fast Speculative Decoding in Large Language Models (2403.09919v5)

Published 14 Mar 2024 in cs.CL and cs.LG

Abstract: We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for LLMs inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM's hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate sequences, and (3) training through knowledge distillation from the LLM. ReDrafter accelerates Vicuna inference in MT-Bench by up to 2.8x with a PyTorch implementation on Nvidia H100 GPUs. To demonstrate its practicality in real environments, we also validated its effectiveness for on-device applications by implementing the approach in MLX and benchmarking performance on Metal GPUs in Apple Silicon chips, achieving up to 2.3x speedup.

Recurrent Drafter for Fast Speculative Decoding in LLMs

Introduction

Recent advancements in LLMs have sparked interest in enhancing their efficiency, particularly during inference. Speculative decoding has emerged as a promising strategy to accelerate LLM inference by using smaller, draft models to predict preliminary candidate tokens. This paper introduces the Recurrent Drafter (ReDrafter), an innovative approach that leverages the strengths of speculative decoding. Unlike existing models that require either multiple models or complex dependencies, ReDrafter employs a single, lightweight drafting head with a recurrent dependency, enabling faster and more efficient speculative decoding.

Proposed Method

The core innovation of ReDrafter is in its drafting strategy, which merges insights from RNN LLMs with speculative decoding. The method utilizes a single set of parameters for the draft head, allowing it to predict multiple tokens in sequence with dependencies accounted for, thereby reducing the complexity traditionally associated with speculative decoding models.

Model Definition

ReDrafter adopts a single-model strategy, integrating embeddings of historical tokens as recurrent inputs. This approach not only simplifies the model but also enhances its predictive ability by considering the sequence's context. A notable departure from the Medusa framework, ReDrafter avoids the creation of a data-dependent attention structure, opting instead for the simplicity of beam search to eliminate suboptimal candidate sequences early in the inference process.

Beam Search and Dynamic Tree Attention

Beam search plays a pivotal role in generating candidate tokens for verification. ReDrafter's strategy enables a direct and efficient method to identify promising candidate sequences, reducing the verification workload on the target model. Furthermore, the model introduces a dynamic tree attention mechanism, an algorithmic enhancement that leverages beam search outcomes to optimize computation and memory usage dynamically, a significant advancement over static tree structures proposed in earlier models.

Experiments

The evaluation of ReDrafter focuses on its training efficiency and inference performance, primarily against the backdrop of existing LLMs and speculative decoding approaches. The paper details an extensive comparison between the proposed method and current strategies, demonstrating ReDrafter's superior speed and reduction in parameter count without sacrificing prediction quality.

Training and Inference Performance

Assessments indicate that ReDrafter not only outperforms its speculative decoding counterparts in predictive accuracy but also achieves this with a substantially lower parameter count. Specifically, the model attains higher speed-up factors (up to 3.28 times) compared to existing methods, illustrating its efficacy in reducing computational overhead during LLM inference.

Discussion and Future Directions

The recurrent drafter represents a significant step forward in speculative decoding, marrying the simplicity of single-model designs with the efficiency of recursive dependencies. Its ability to dynamically construct attention structures based on beam search outcomes further distinguishes it from prior models, offering a more flexible and effective approach to speculative decoding.

While ReDrafter demonstrates considerable promise, the paper also acknowledges potential areas for future development, such as exploring more complex model structures and joint training mechanisms to further enhance performance.

Conclusion

In conclusion, the introduction of ReDrafter marks a noteworthy advancement in improving the efficiency of LLMs through speculative decoding. By combining the benefits of recurrent neural network models with innovative beam search and dynamic tree attention techniques, this approach sets a new standard for speculative decoding, offering a pathway towards more efficient and effective utilization of LLMs in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a.
  6. Cascade speculative drafting for even faster llm inference. arXiv preprint arXiv:2312.11462, 2023b.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2023.
  9. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  11. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  12. Truncation sampling as language model desmoothing. arXiv preprint arXiv:2210.15191, 2022.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  14. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024.
  15. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023.
  16. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  17. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp.  234–239, 2012. doi: 10.1109/SLT.2012.6424228.
  18. ShareGPT. Sharegpt, 2023. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
  19. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  21. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023.
  22. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aonan Zhang (32 papers)
  2. Chong Wang (308 papers)
  3. Yi Wang (1038 papers)
  4. Xuanyu Zhang (34 papers)
  5. Yunfei Cheng (2 papers)
Citations (11)
Reddit Logo Streamline Icon: https://streamlinehq.com