Recurrent Drafter for Fast Speculative Decoding in LLMs
Introduction
Recent advancements in LLMs have sparked interest in enhancing their efficiency, particularly during inference. Speculative decoding has emerged as a promising strategy to accelerate LLM inference by using smaller, draft models to predict preliminary candidate tokens. This paper introduces the Recurrent Drafter (ReDrafter), an innovative approach that leverages the strengths of speculative decoding. Unlike existing models that require either multiple models or complex dependencies, ReDrafter employs a single, lightweight drafting head with a recurrent dependency, enabling faster and more efficient speculative decoding.
Proposed Method
The core innovation of ReDrafter is in its drafting strategy, which merges insights from RNN LLMs with speculative decoding. The method utilizes a single set of parameters for the draft head, allowing it to predict multiple tokens in sequence with dependencies accounted for, thereby reducing the complexity traditionally associated with speculative decoding models.
Model Definition
ReDrafter adopts a single-model strategy, integrating embeddings of historical tokens as recurrent inputs. This approach not only simplifies the model but also enhances its predictive ability by considering the sequence's context. A notable departure from the Medusa framework, ReDrafter avoids the creation of a data-dependent attention structure, opting instead for the simplicity of beam search to eliminate suboptimal candidate sequences early in the inference process.
Beam Search and Dynamic Tree Attention
Beam search plays a pivotal role in generating candidate tokens for verification. ReDrafter's strategy enables a direct and efficient method to identify promising candidate sequences, reducing the verification workload on the target model. Furthermore, the model introduces a dynamic tree attention mechanism, an algorithmic enhancement that leverages beam search outcomes to optimize computation and memory usage dynamically, a significant advancement over static tree structures proposed in earlier models.
Experiments
The evaluation of ReDrafter focuses on its training efficiency and inference performance, primarily against the backdrop of existing LLMs and speculative decoding approaches. The paper details an extensive comparison between the proposed method and current strategies, demonstrating ReDrafter's superior speed and reduction in parameter count without sacrificing prediction quality.
Training and Inference Performance
Assessments indicate that ReDrafter not only outperforms its speculative decoding counterparts in predictive accuracy but also achieves this with a substantially lower parameter count. Specifically, the model attains higher speed-up factors (up to 3.28 times) compared to existing methods, illustrating its efficacy in reducing computational overhead during LLM inference.
Discussion and Future Directions
The recurrent drafter represents a significant step forward in speculative decoding, marrying the simplicity of single-model designs with the efficiency of recursive dependencies. Its ability to dynamically construct attention structures based on beam search outcomes further distinguishes it from prior models, offering a more flexible and effective approach to speculative decoding.
While ReDrafter demonstrates considerable promise, the paper also acknowledges potential areas for future development, such as exploring more complex model structures and joint training mechanisms to further enhance performance.
Conclusion
In conclusion, the introduction of ReDrafter marks a noteworthy advancement in improving the efficiency of LLMs through speculative decoding. By combining the benefits of recurrent neural network models with innovative beam search and dynamic tree attention techniques, this approach sets a new standard for speculative decoding, offering a pathway towards more efficient and effective utilization of LLMs in real-world applications.