BASS: Batched Attention-optimized Speculative Sampling (2404.15778v2)

Published 24 Apr 2024 in cs.LG and cs.CL

Abstract: Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting LLMs. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

References (26)

Authors (9)

Haifeng Qian (27 papers)
Sujan Kumar Gonugondla (6 papers)
Sungsoo Ha (4 papers)
Mingyue Shang (13 papers)
Sanjay Krishna Gouda (7 papers)
Ramesh Nallapati (38 papers)
Sudipta Sengupta (7 papers)
Xiaofei Ma (31 papers)
Anoop Deoras (21 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper presents a batched method for speculative sampling that processes multiple sequences simultaneously, overcoming single-sequence limitations.
It leverages dynamic draft lengths and customized CUDA kernels to efficiently manage ragged tensors and optimize memory usage.
Experimental results on models like OPT 13B demonstrate over 2.15x speed-up and more than threefold improvement in GPU utilization.

Batched Attention-optimized Speculative Sampling: Innovations in Multi-Sequence Generation for LLMs

Introduction

The evolution of LLM inference methods continues to be a critical area of research, particularly as models are scaled to billions of parameters. Batched Attention-optimized Speculative Sampling (BASS) emerges as a significant enhancement over existing methods, addressing the inefficiencies of speculative decoding with single-sequence batches by introducing a method capable of handling multiple sequences simultaneously. Demonstrating significant improvements in latency, GPU utilization, and generation quality under tight time constraints, BASS optimizes speculative decoding across multiple dimensions.

Key Challenges and Methodology

Existing Limitations

Traditional speculative decoding techniques are largely limited to single-sequence processing, curbing the potential to exploit parallelism in modern GPU hardware. This limitation is particularly impactful in scenario where multiple sequence outputs are required simultaneously, as is common in practical AI applications, where latency and throughput are critical.

BASS Overview

BASS successfully extends speculative decoding beyond these limitations, leveraging batch processes and a novel approach to calculating attention across variable-length sequences within batches. The core technique involves speculative token drafts performed in parallel, with dynamic adjustments based on the acceptability of these drafts, which significantly enhances throughput.

Technical Innovations

CUDA Kernels and Ragged Tensors: Handling of ragged tensors via customized CUDA kernels, facilitating efficient memory management and parallel computation.
Dynamic Draft Length: Algorithmic innovation to dynamically adjust draft token length, enhancing flexibility and adaptability during inference.

Experimental Setup and Results

Models and Metrics

The system was evaluated using models like OPT 13B and CodeGen-Mono 16B, with metrics including HumanEval pass@k for coding, and ROUGE for summarization tasks.

Findings

With BASS, notable improvements were observed:

Latency and Throughput: Achieved up to 2.15 times speed-up over optimized regular decoding methods.
GPU Utilization: Enhanced peak usage markedly, showing more than threefold improvement over regular decoding approaches.

These improvements catalyze better performance in applications like coding assistants, conversational agents, and more, without the expense of extended wait times typical of earlier methods.

Implications and Future Directions

Practical Implications

BASS allows real-world AI systems, especially those requiring real-time interaction and generation of multiple responses, to function more efficiently. This capability can transform user experiences across interfaces where rapid responses are essential.

Theoretical Contributions

This research contributes to the understanding of how speculative decoding can be synergized with batch processing to overcome core challenges in AI inference, such as those posed by memory bandwidth and latency.

Future Research

Further exploration into reducing disparities in GPU utilization across different phases of model inference could yield even faster and more efficient systems. Adapting BASS to a wider range of model architectures and sizes, and further optimizing CUDA implementations could also extend the benefits observed.

In conclusion, Batched Attention-optimized Speculative Sampling sets a new benchmark in the utilization and efficiency of LLMs, offering pathways to both theoretical and practical enhancements in the field of AI model inference.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1783549490898567264

https://twitter.com/GptMaestro/status/1788081969475227769