Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Large Language Model Decoding with Speculative Sampling (2302.01318v1)

Published 2 Feb 2023 in cs.CL

Abstract: We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter LLM, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Charlie Chen (10 papers)
  2. Sebastian Borgeaud (19 papers)
  3. Geoffrey Irving (31 papers)
  4. Jean-Baptiste Lespiau (17 papers)
  5. Laurent Sifre (21 papers)
  6. John Jumper (2 papers)
Citations (278)

Summary

Accelerating LLM Decoding with Speculative Sampling

The paper "Accelerating LLM Decoding with Speculative Sampling" presents a novel approach to improving the efficiency of large-scale transformer model decoding, specifically targeting the latency issues that arise when dealing with models of significant parameter magnitude, such as Chinchilla, a 70-billion parameter LLM. The proposed technique, dubbed speculative sampling (SpS), is designed to enhance the generation of sequences by leveraging a more computationally efficient draft model alongside the primary target model.

The core innovation of SpS lies in its ability to generate multiple tokens per call to the transformer model. This is achieved by utilizing a draft model to produce a short sequence of candidate tokens, which are then scored using the more sophisticated target model. A critical component of this method is a modified rejection sampling algorithm, which ensures that the resulting distribution retains fidelity to the target model. The modified rejection sampling allows for the acceptance of a subset of the tokens proposed by the draft, conditioned on their probability under the target model's distribution relative to that under the draft model.

In deploying this approach, the authors focus on LLMs wherein the transformer sampling process often becomes constrained by memory bandwidth. The paper notes that traditional auto-regressive sampling, which generates one token at a time, is inefficient in this regime due to these bandwidth limitations and the inherent communication overhead brought about by model parallelism. Speculative sampling deftly mitigates these issues by exploiting the potential for stronger congruence between the draft and target models in predicting obvious tokens or token sequences.

The practical advantages of speculative sampling are demonstrated through rigorous benchmarking with the Chinchilla model. In a distributed setting, SpS achieves a speedup of approximately $2$ to $2.5$ times compared to conventional methods, all without compromising the sample quality. Notably, this speedup occasionally surpasses the theoretical memory bandwidth ceiling for auto-regressive sampling, a testament to the efficiency gains realized through this parallel sampling approach.

The theoretical underpinning of SpS is complemented by empirical validation across different tasks, such as the XSum and HumanEval tasks, utilizing varying decoding strategies including greedy and nucleus sampling. These experiments reveal that speculative sampling not only offers a substantial reduction in latency but does so while maintaining or slightly improving the benchmarks compared to traditional methods. This signifies that the SpS methodology is robust to implementation variances and decoding techniques, increasing its potential applicability.

The contribution of this paper is particularly impactful in that it provides a scalable solution for reducing the latency of serving LLMs, a major concern for applications requiring rapid response times. By not altering the architecture or parameters of the target model, SpS allows for backward compatibility and ease of adoption in existing deployments. Furthermore, this approach is integrable with other optimization techniques such as quantization and multi-query attention, making it a versatile addition to the repertoire of model optimization strategies.

Looking forward, the implications of this work suggest further exploration into the development of efficient draft models, potentially optimizing them through novel training paradigms or architectures that better capture the dynamics of token prediction in combination with large-scale models. As LLMs continue to grow both in parameter count and application domain complexity, techniques like speculative sampling will play an increasingly crucial role in ensuring they are not only powerful but also operationally viable for extensive real-world use.

This paper exemplifies a strategic approach to addressing the computational burdens of large model inference by innovatively harnessing existing computational mechanics, paving the way for more efficient deployment in latency-sensitive scenarios. The positive outcomes of SpS encourage continued research, potentially yielding further enhancements and adaptations of this technique within varying AI disciplines.

Youtube Logo Streamline Icon: https://streamlinehq.com