Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion (2408.05636v2)

Published 10 Aug 2024 in cs.CL and cs.LG

Abstract: Speculative decoding has emerged as a widely adopted method to accelerate LLM inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling parallel sequence verification, its efficiency remains inherently limited by the reliance on incremental token generation in existing draft models. To overcome this limitation, this paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences. This allows parallelization of both the drafting and verification steps, providing significant speed-ups to the inference process. Our proposed approach, Speculative Diffusion Decoding (SpecDiff), is validated on standard language generation benchmarks and empirically demonstrated to provide a up to 8.7x speed-up over standard generation processes and up to 2.5x speed-up over existing speculative decoding approaches.

PDF HTML Abstract

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

The paper entitled "Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion" by Christopher, Bartoldson, Kailkhura, and Fioretto investigates an innovative method to enhance the efficiency of LLM inference. This method, termed Speculative Diffusion Decoding (SpecDiff), leverages the strengths of discrete diffusion models to facilitate parallelization of draft and verification steps, achieving substantial speed-ups in language generation tasks.

Introduction and Motivation

With the scaling of autoregressive LLMs (e.g., transformers) to larger computational frameworks, their performance in various natural language processing tasks has improved significantly. However, the computational cost associated with the inference of these models, especially when applied to millions of users, presents a considerable challenge. Existing methods to reduce these costs, such as sparsity, quantization, and distillation, tend to introduce trade-offs in model performance. Speculative decoding has emerged as an alternative that maintains output quality while improving efficiency.

Current speculative decoding methods involve a smaller, fast drafting model generating token sequences that are subsequently verified by the larger target model. However, these methods remain limited by their reliance on sequential token generation. The paper addresses this limitation by proposing SpecDiff, which integrates discrete diffusion models to enable parallelization of both drafting and verification steps.

Methodology

Speculative decoding relies on two models:

Drafting Model (M_q): A smaller, more efficient autoregressive model that generates approximations of the target model’s output distribution.
Target Model (M_p): The larger, more computationally intensive model whose output quality is the benchmark.

SpecDiff innovates by utilizing a discrete diffusion model as the drafting model. Discrete diffusion models generate entire sequences in parallel, allowing the value of $\gamma$ (the length of token sequences generated by the drafter) to be increased without significant computational overhead. The core advantage of SpecDiff lies in its ability to decouple token generation from sequential dependencies, enabling much larger sequences to be generated more efficiently.

The procedural steps of SpecDiff involve generating draft logits using the discrete diffusion drafter and verifying these logits with the target model in parallel. The tokens generated by the drafting model are subject to an acceptance criterion where the relationship between the distributions generated by the drafter and target models determines the acceptance or rejection of tokens.

Experimental Results

The efficacy of SpecDiff was validated on standard language generation benchmarks, specifically text summarization using the CNN/DM dataset and text generation finetuned on the OpenWebText dataset. The experimental setup involved target models GPT-2 XL and GPT-NEO and a comparable smaller drafter model, SEDD-Absorbing Small.

Key findings from the experiments include:

Speed-Up: SpecDiff demonstrated up to an 8.7x speed-up over standard generation processes and up to a 2.5x improvement over existing speculative decoding methods.
Token Acceptance Rate: Despite a lower acceptance rate ( $\alpha$ ), the larger value of $\gamma$ facilitated by the diffusion model resulted in a higher number of accepted tokens per draft sequence.
Hyperparameter Sensitivity: SpecDiff's performance showed robustness to changes in $\gamma$ , with sensitivity primarily to the number of diffusion steps (T).

Discussion

The key contribution of SpecDiff is its ability to leverage the parallel generation capabilities of discrete diffusion models to significantly accelerate the language generation process without compromising on output quality. The proposed approach not only expedites inference times but also sets a new benchmark for speed in language completion tasks.

From a theoretical perspective, the research illustrates a novel integration of diffusion-based and autoregressive methods, providing insights into how advancements in one domain can enhance performance in another. Practically, this innovation has the potential to reduce the computational burden associated with deploying LLMs at scale, making it more feasible to utilize advanced LLMs in real-time applications.

Future Work and Limitations

The authors acknowledge that the current implementation of SpecDiff is limited by the tokenization schemes of the models used. Expanding this approach to larger models and exploring the integration of hierarchical speculative decoding methods could provide further speed enhancements. Additionally, the evaluation highlighted that SpecDiff's benefits are more pronounced in longer generation tasks, suggesting a need for further optimization for shorter sequences.

Further research could also investigate the impact of hot-starting the drafter model with logits from rejected tokens, as this has shown promise in other diffusion model applications. Comparing SpecDiff with recent advancements in tree-based speculative decoding approaches could also yield valuable insights into optimizing parallelism in drafting processes.

Conclusion

The paper presents a novel approach that integrates discrete diffusion models into speculative decoding to enhance the efficiency of language generation tasks significantly. The empirical results affirm that SpecDiff achieves impressive speed-ups while maintaining high output quality, positioning this method as a viable solution to the computational challenges inherent in large-scale LLM inference.

References

For an in-depth examination of the methodologies, experiments, and discussions, readers are encouraged to refer to the original paper and the associated references within its bibliography.