PaSS: Parallel Speculative Sampling (2311.13581v1)

Published 22 Nov 2023 in cs.CL

Abstract: Scaling the size of LLMs to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.

Authors (3)

Giovanni Monea (6 papers)
Armand Joulin (81 papers)
Edouard Grave (56 papers)

Citations (22)

View on Semantic Scholar

Summary

Parallel Speculative Sampling: Accelerating Inference in LLMs

The paper "PaSS: Parallel Speculative Sampling" addresses a significant challenge in deploying LLMs: the inference time associated with generating textual outputs. LLMs, post-Transformer development, have achieved remarkable success across a spectrum of natural language processing tasks, owing largely to their substantial parameter sizes. However, this scale induces demanding memory and computational costs, primarily due to the auto-regressive nature of generation which necessitates serial model calls for each token produced. This work introduces an innovative approach termed Parallel Speculative Sampling (PaSS), aiming to alleviate these computational burdens by optimizing the generative process.

Key Contributions

Speculative Sampling Framework: Traditional speculative sampling utilizes a secondary, smaller model to preliminarily draft sequences of tokens. The primary, larger model subsequently evaluates these drafts, ensuring high-quality output through a selective rejection mechanism. This method, while effective in accelerating inference, imposes the overhead of maintaining two aligned models.
Parallel Speculative Sampling (PaSS): In contrast to conventional speculative methods, PaSS eliminates the need for a second model by leveraging a single model to produce multiple tokens concurrently. The introduction of "look-ahead embeddings" facilitates this parallel generation, enabling token drafting directly within the primary model's workflow. This approach not only circumvents the memory overhead associated with dual models but also integrates seamlessly with pre-trained LLMs without architectural modifications.
Minimal Parameter Expansion: The deployment of look-ahead embeddings involves adding only a linear set of parameters, $O(d_{emb})$ , significantly reducing the computational footprint compared to the broader requirements of small supplementary models used in classical speculative sampling.

Numerical and Empirical Results

The implementation of PaSS demonstrates promising speed-ups of up to 30% compared to standard auto-regressive generation. Empirical evaluations conducted on tasks such as text and code completion, using datasets like English Wikipedia and The Stack, showcase these improvements without degrading output quality. For instance, the HumanEval benchmark exhibits comparable performance metrics to auto-regressive methods, reinforcing PaSS's efficacy in maintaining generative fidelity while enhancing computational efficiency.

Implications and Future Directions

The proposed PaSS framework holds considerable implications for both theoretical exploration and practical applications in NLP. By enabling faster generation with substantially reduced computational resources, it offers a viable pathway to scaling LLMs further, potentially supporting real-time applications and facilitating broader accessibility.

In a forward-looking perspective, the authors suggest exploring enhancements to look-ahead embeddings to optimize parallel extraction of linguistic patterns. Additionally, bridging the gap between non-auto-regressive and speculative sampling paradigms could open new avenues for efficient model deployment strategies, particularly in resource-constrained environments.

Conclusion

"PaSS: Parallel Speculative Sampling" contributes a significant advancement in the methodology of efficient text generation, addressing both the computational bottleneck and memory constraints associated with large-scale LLMs. Through its novel approach, it paves the way for more sustainable and expansive utilization of generative models across diverse applications, underpinning the growing demand for accelerated and scalable AI solutions.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gaur_manu/status/1809732186385605106

https://twitter.com/_vaishnavh/status/1786066821965045946

YouTube

Show All Videos