Fast Inference from Transformers via Speculative Decoding (2211.17192v2)

Published 30 Nov 2022 in cs.LG and cs.CL

Abstract: Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

PDF Abstract

Speculative Decoding for Accelerated Transformer Inference

The paper "Fast Inference from Transformers via Speculative Decoding" introduces an algorithm named speculative decoding aimed at enhancing the efficiency of inference from large autoregressive models, specifically Transformers, without necessitating any modifications to the model architecture or retraining. This method leverages the principle of speculative execution, a common technique in processor optimization, to sample multiple tokens in parallel, thus potentially generating several tokens concurrently.

Key Contributions

Generalization of Speculative Execution to Stochastic Settings: The authors present speculative sampling—a novel method designed to facilitate speculative execution in stochastic environments. This approach guarantees that the output distribution remains identical to that of the target model.
Speculative Decoding Algorithm: This algorithm accelerates inference by utilizing smaller, more efficient approximation models to generate speculative token suggestions. These suggestions are validated or rejected by running the larger target model in parallel.

Methodology

The core methodology relies on the observation that tasks within LLMing often contain easier subtasks that can be approximated well by smaller models. The speculative decoding algorithm involves three primary steps:

Approximation Model Generation: An efficient approximation model $M_q$ generates multiple token completions.
Parallel Target Model Validation: The target model $M_p$ evaluates these tokens in parallel, accepting those that align with its distribution and rejecting others.
Adjusted Sampling: For any rejected tokens, the target model samples an additional token from an adjusted distribution to maintain the correct output distribution.

This method ensures that the number of serial runs of the target model does not exceed the number required by the conventional autoregressive method. The parallel execution of validation tasks capitalizes on available compute resources, significantly speeding up the inference process.

Empirical Results

The authors demonstrate the efficacy of speculative decoding on various models and tasks:

T5-XXL Model: By using T5-small (77M) as the approximation model $M_q$ , the method achieved a 2.6X to 3.4X speedup in tasks including English to German translation and text summarization without altering the output quality.
Other Transformer Models: The approach was also validated on GPT-like and LaMDA models, showing consistent improvements in inference speed.

Analysis and Theoretical Implications

The paper provides a thorough analysis of the expected speedup and the increase in arithmetic operations:

Acceptance Rate ( $\alpha$ ): The acceptance rate of speculative token suggestions is crucial in determining the effectiveness of the method. Higher alpha values lead to fewer rejected tokens, thus enhancing speedup.
Cost Coefficient ( $c$ ): The ratio of the computational cost of the approximation model to the target model influences the overall walltime improvement. A lower $c$ value supports more efficient speculative execution.

The authors analyze these parameters and derive formulas predicting the speedup and resource usage, confirmed by empirical results matching theoretical expectations.

Implications and Future Directions

Practically, speculative decoding offers a straightforward way to accelerate inference for LLMs using existing architectures and training procedures. The method's simplicity and the guaranteed preservation of output distributions make it suitable for production environments where computational resources are abundant, but memory bandwidth is the primary bottleneck.

Theoretically, the method opens several avenues for future research:

Adaptive Model Architectures: Investigating custom-trained or dynamically adaptive approximation models could yield further improvements.
Hierarchical Speculative Decoding: Exploring multi-tier speculative decoding with nested approximation models.
Application to Other Domains: Extending speculative decoding to other domains such as image generation or reinforcement learning could validate its broader applicability.

Conclusion

Speculative decoding marks a significant advancement in enhancing inference efficiency for large autoregressive models. By intelligently leveraging smaller models to suggest and parallel validate token completions, the method achieves substantial speedups while maintaining output fidelity. This work sets the stage for further exploration and optimization in speculative execution techniques, promising significant impacts on the practical deployment of large-scale neural models.