Accelerating Large Language Model Decoding with Speculative Sampling

Published 2 Feb 2023 in cs.CL | (2302.01318v1)

Abstract: We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter LLM, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (278)

View on Semantic Scholar

Summary

The paper presents speculative sampling, integrating a draft model to generate multiple tokens per call and significantly reduce decoding latency.
It employs a modified rejection sampling algorithm to ensure fidelity to the target model while benefiting from computational efficiency.
Benchmark evaluations on the 70B Chinchilla model reveal a 2 to 2.5 times speedup in distributed settings without quality loss.

Accelerating LLM Decoding with Speculative Sampling

The paper "Accelerating LLM Decoding with Speculative Sampling" presents a novel approach to improving the efficiency of large-scale transformer model decoding, specifically targeting the latency issues that arise when dealing with models of significant parameter magnitude, such as Chinchilla, a 70-billion parameter LLM. The proposed technique, dubbed speculative sampling (SpS), is designed to enhance the generation of sequences by leveraging a more computationally efficient draft model alongside the primary target model.

The core innovation of SpS lies in its ability to generate multiple tokens per call to the transformer model. This is achieved by utilizing a draft model to produce a short sequence of candidate tokens, which are then scored using the more sophisticated target model. A critical component of this method is a modified rejection sampling algorithm, which ensures that the resulting distribution retains fidelity to the target model. The modified rejection sampling allows for the acceptance of a subset of the tokens proposed by the draft, conditioned on their probability under the target model's distribution relative to that under the draft model.

In deploying this approach, the authors focus on LLMs wherein the transformer sampling process often becomes constrained by memory bandwidth. The paper notes that traditional auto-regressive sampling, which generates one token at a time, is inefficient in this regime due to these bandwidth limitations and the inherent communication overhead brought about by model parallelism. Speculative sampling deftly mitigates these issues by exploiting the potential for stronger congruence between the draft and target models in predicting obvious tokens or token sequences.

The practical advantages of speculative sampling are demonstrated through rigorous benchmarking with the Chinchilla model. In a distributed setting, SpS achieves a speedup of approximately $2$ to $2.5$ times compared to conventional methods, all without compromising the sample quality. Notably, this speedup occasionally surpasses the theoretical memory bandwidth ceiling for auto-regressive sampling, a testament to the efficiency gains realized through this parallel sampling approach.

The theoretical underpinning of SpS is complemented by empirical validation across different tasks, such as the XSum and HumanEval tasks, utilizing varying decoding strategies including greedy and nucleus sampling. These experiments reveal that speculative sampling not only offers a substantial reduction in latency but does so while maintaining or slightly improving the benchmarks compared to traditional methods. This signifies that the SpS methodology is robust to implementation variances and decoding techniques, increasing its potential applicability.

The contribution of this paper is particularly impactful in that it provides a scalable solution for reducing the latency of serving LLMs, a major concern for applications requiring rapid response times. By not altering the architecture or parameters of the target model, SpS allows for backward compatibility and ease of adoption in existing deployments. Furthermore, this approach is integrable with other optimization techniques such as quantization and multi-query attention, making it a versatile addition to the repertoire of model optimization strategies.

Looking forward, the implications of this work suggest further exploration into the development of efficient draft models, potentially optimizing them through novel training paradigms or architectures that better capture the dynamics of token prediction in combination with large-scale models. As LLMs continue to grow both in parameter count and application domain complexity, techniques like speculative sampling will play an increasingly crucial role in ensuring they are not only powerful but also operationally viable for extensive real-world use.

This paper exemplifies a strategic approach to addressing the computational burdens of large model inference by innovatively harnessing existing computational mechanics, paving the way for more efficient deployment in latency-sensitive scenarios. The positive outcomes of SpS encourage continued research, potentially yielding further enhancements and adaptations of this technique within varying AI disciplines.

Markdown Report Issue