SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference (2411.04975v1)

Published 7 Nov 2024 in cs.CL, cs.AI, cs.DC, and cs.LG

Abstract: We present SuffixDecoding, a novel model-free approach to accelerating LLM inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.

Authors (4)

Gabriele Oliaro (10 papers)
Zhihao Jia (43 papers)
Daniel Campos (62 papers)
Aurick Qiao (9 papers)

Summary

SuffixDecoding: A Model-Free Approach to Speeding Up LLM Inference

The paper presents SuffixDecoding, a novel model-free approach designed to accelerate the inference of LLMs by employing speculative decoding. Unlike traditional methods that typically rely on draft models or additional decoding heads, SuffixDecoding leverages efficient data structures built from previously generated outputs—specifically, suffix trees—to predict candidate sequences. This innovative technique utilizes pattern recognition in previously generated text to construct speculation trees, providing a theoretically grounded and empirical method for selecting which token sequences to propose for verification by the LLM.

Technical Approach

SuffixDecoding constructs and updates suffix trees based on generated token sequences to model the probability of future sequences. This approach requires only the computational power of CPUs rather than that of GPUs, which is advantageous given the typical underutilized CPU resources in LLM serving nodes. The suffix trees store tokens of previously generated sequences, capturing shared prefixes in a compact structure. Each node in a suffix tree represents a token, and traversals along these nodes represent possible continuations during LLM inference.

Once a suffix tree is constructed, SuffixDecoding utilizes a greedy algorithm to expand speculative trees, leveraging empirical frequency statistics to score and select the most promising candidate sequences. This tree structure allows for efficient speculation, enabling parallel verification across potential candidate sequences.

Evaluation

Empirical evaluations demonstrate that SuffixDecoding achieves competitive performance with state-of-the-art model-based speculative methods across multiple workloads, including open-domain chat scenarios, code generation tasks, and text-to-SQL systems. Particularly notable is the improvement observed in multi-agent pipeline applications. For instance, in the proprietary multi-LLM text-to-SQL environment, named AgenticSQL, SuffixDecoding achieves up to 2.9× higher throughput and up to 3× lower latency compared to existing speculative decoding methods.

Furthermore, tests on datasets such as Magicoder and WildChat showed that SuffixDecoding's performance is not only comparable to that of tree-based speculative decoding techniques but, in some cases, even surpasses them without the overhead associated with draft models. This is particularly significant given that SuffixDecoding can achieve these results with just a few thousand examples in its reference corpus, showcasing its efficiency and practicality.

Implications and Future Work

The implications of SuffixDecoding's introduction are multifaceted. Practically, it offers a scalable and more resource-efficient option for speeding up inference, which is particularly beneficial in environments where GPU resources are constrained or where rapid model updates are frequent. Theoretically, it paves the way for further exploration into model-free speculative inference techniques, potentially leading to algorithms that adapt even more dynamically to real-world applications.

Future developments might focus on enhancing the pattern matching and scoring mechanisms of SuffixDecoding. While its performance in adapting to distributional shifts in the input has been demonstrated effectively, there remains potential to optimize how the algorithm prioritizes and scores different speculative paths, perhaps incorporating more complex statistical models.

In conclusion, SuffixDecoding marks a significant advance in leveraging data-driven techniques for LLM inference, minimizing dependence on resource-intensive draft models while providing robust performance across a diverse array of tasks. Continued research in model-free approaches may yield further innovations, expanding the applicability and efficiency of LLMs in practical settings.

PDF Markdown

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos