SuffixDecoding: A Model-Free Approach to Speeding Up LLM Inference
The paper presents SuffixDecoding, a novel model-free approach designed to accelerate the inference of LLMs by employing speculative decoding. Unlike traditional methods that typically rely on draft models or additional decoding heads, SuffixDecoding leverages efficient data structures built from previously generated outputs—specifically, suffix trees—to predict candidate sequences. This innovative technique utilizes pattern recognition in previously generated text to construct speculation trees, providing a theoretically grounded and empirical method for selecting which token sequences to propose for verification by the LLM.
Technical Approach
SuffixDecoding constructs and updates suffix trees based on generated token sequences to model the probability of future sequences. This approach requires only the computational power of CPUs rather than that of GPUs, which is advantageous given the typical underutilized CPU resources in LLM serving nodes. The suffix trees store tokens of previously generated sequences, capturing shared prefixes in a compact structure. Each node in a suffix tree represents a token, and traversals along these nodes represent possible continuations during LLM inference.
Once a suffix tree is constructed, SuffixDecoding utilizes a greedy algorithm to expand speculative trees, leveraging empirical frequency statistics to score and select the most promising candidate sequences. This tree structure allows for efficient speculation, enabling parallel verification across potential candidate sequences.
Evaluation
Empirical evaluations demonstrate that SuffixDecoding achieves competitive performance with state-of-the-art model-based speculative methods across multiple workloads, including open-domain chat scenarios, code generation tasks, and text-to-SQL systems. Particularly notable is the improvement observed in multi-agent pipeline applications. For instance, in the proprietary multi-LLM text-to-SQL environment, named AgenticSQL, SuffixDecoding achieves up to 2.9× higher throughput and up to 3× lower latency compared to existing speculative decoding methods.
Furthermore, tests on datasets such as Magicoder and WildChat showed that SuffixDecoding's performance is not only comparable to that of tree-based speculative decoding techniques but, in some cases, even surpasses them without the overhead associated with draft models. This is particularly significant given that SuffixDecoding can achieve these results with just a few thousand examples in its reference corpus, showcasing its efficiency and practicality.
Implications and Future Work
The implications of SuffixDecoding's introduction are multifaceted. Practically, it offers a scalable and more resource-efficient option for speeding up inference, which is particularly beneficial in environments where GPU resources are constrained or where rapid model updates are frequent. Theoretically, it paves the way for further exploration into model-free speculative inference techniques, potentially leading to algorithms that adapt even more dynamically to real-world applications.
Future developments might focus on enhancing the pattern matching and scoring mechanisms of SuffixDecoding. While its performance in adapting to distributional shifts in the input has been demonstrated effectively, there remains potential to optimize how the algorithm prioritizes and scores different speculative paths, perhaps incorporating more complex statistical models.
In conclusion, SuffixDecoding marks a significant advance in leveraging data-driven techniques for LLM inference, minimizing dependence on resource-intensive draft models while providing robust performance across a diverse array of tasks. Continued research in model-free approaches may yield further innovations, expanding the applicability and efficiency of LLMs in practical settings.