Recurrent Drafter for Fast Speculative Decoding in Large Language Models (2403.09919v5)

Published 14 Mar 2024 in cs.CL and cs.LG

Abstract: We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for LLMs inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM's hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate sequences, and (3) training through knowledge distillation from the LLM. ReDrafter accelerates Vicuna inference in MT-Bench by up to 2.8x with a PyTorch implementation on Nvidia H100 GPUs. To demonstrate its practicality in real environments, we also validated its effectiveness for on-device applications by implementing the approach in MLX and benchmarking performance on Metal GPUs in Apple Silicon chips, achieving up to 2.3x speedup.

PDF HTML Abstract

Recurrent Drafter for Fast Speculative Decoding in LLMs

Introduction

Recent advancements in LLMs have sparked interest in enhancing their efficiency, particularly during inference. Speculative decoding has emerged as a promising strategy to accelerate LLM inference by using smaller, draft models to predict preliminary candidate tokens. This paper introduces the Recurrent Drafter (ReDrafter), an innovative approach that leverages the strengths of speculative decoding. Unlike existing models that require either multiple models or complex dependencies, ReDrafter employs a single, lightweight drafting head with a recurrent dependency, enabling faster and more efficient speculative decoding.

Proposed Method

The core innovation of ReDrafter is in its drafting strategy, which merges insights from RNN LLMs with speculative decoding. The method utilizes a single set of parameters for the draft head, allowing it to predict multiple tokens in sequence with dependencies accounted for, thereby reducing the complexity traditionally associated with speculative decoding models.

Model Definition

ReDrafter adopts a single-model strategy, integrating embeddings of historical tokens as recurrent inputs. This approach not only simplifies the model but also enhances its predictive ability by considering the sequence's context. A notable departure from the Medusa framework, ReDrafter avoids the creation of a data-dependent attention structure, opting instead for the simplicity of beam search to eliminate suboptimal candidate sequences early in the inference process.

Beam Search and Dynamic Tree Attention

Beam search plays a pivotal role in generating candidate tokens for verification. ReDrafter's strategy enables a direct and efficient method to identify promising candidate sequences, reducing the verification workload on the target model. Furthermore, the model introduces a dynamic tree attention mechanism, an algorithmic enhancement that leverages beam search outcomes to optimize computation and memory usage dynamically, a significant advancement over static tree structures proposed in earlier models.

Experiments

The evaluation of ReDrafter focuses on its training efficiency and inference performance, primarily against the backdrop of existing LLMs and speculative decoding approaches. The paper details an extensive comparison between the proposed method and current strategies, demonstrating ReDrafter's superior speed and reduction in parameter count without sacrificing prediction quality.

Training and Inference Performance

Assessments indicate that ReDrafter not only outperforms its speculative decoding counterparts in predictive accuracy but also achieves this with a substantially lower parameter count. Specifically, the model attains higher speed-up factors (up to 3.28 times) compared to existing methods, illustrating its efficacy in reducing computational overhead during LLM inference.

Discussion and Future Directions

The recurrent drafter represents a significant step forward in speculative decoding, marrying the simplicity of single-model designs with the efficiency of recursive dependencies. Its ability to dynamically construct attention structures based on beam search outcomes further distinguishes it from prior models, offering a more flexible and effective approach to speculative decoding.

While ReDrafter demonstrates considerable promise, the paper also acknowledges potential areas for future development, such as exploring more complex model structures and joint training mechanisms to further enhance performance.

Conclusion

In conclusion, the introduction of ReDrafter marks a noteworthy advancement in improving the efficiency of LLMs through speculative decoding. By combining the benefits of recurrent neural network models with innovative beam search and dynamic tree attention techniques, this approach sets a new standard for speculative decoding, offering a pathway towards more efficient and effective utilization of LLMs in real-world applications.