Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BASS: Batched Attention-optimized Speculative Sampling (2404.15778v2)

Published 24 Apr 2024 in cs.LG and cs.CL

Abstract: Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting LLMs. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  3. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
  4. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. Generating long sequences with sparse transformers. URL https://openai.com/blog/sparse-transformers.
  7. Elias Frantar and Dan Alistarh. 2023. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  8. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
  9. Breaking the sequential dependency of llm inference using lookahead decoding. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding.
  10. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322.
  11. Scaling laws for neural language models. CoRR, abs/2001.08361.
  12. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  13. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225.
  14. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  15. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077.
  16. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  17. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv.
  18. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
  19. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
  20. Language models are unsupervised multitask learners.
  21. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems.
  22. The synergy of speculative decoding and batching in serving large language models. arXiv preprint arXiv:2310.18813.
  23. Attention is all you need. In Advances in Neural Information Processing Systems.
  24. Speculative decoding: Lossless speedup of autoregressive translation.
  25. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  26. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haifeng Qian (27 papers)
  2. Sujan Kumar Gonugondla (6 papers)
  3. Sungsoo Ha (4 papers)
  4. Mingyue Shang (13 papers)
  5. Sanjay Krishna Gouda (7 papers)
  6. Ramesh Nallapati (38 papers)
  7. Sudipta Sengupta (7 papers)
  8. Xiaofei Ma (31 papers)
  9. Anoop Deoras (21 papers)
Citations (5)

Summary

  • The paper presents a batched method for speculative sampling that processes multiple sequences simultaneously, overcoming single-sequence limitations.
  • It leverages dynamic draft lengths and customized CUDA kernels to efficiently manage ragged tensors and optimize memory usage.
  • Experimental results on models like OPT 13B demonstrate over 2.15x speed-up and more than threefold improvement in GPU utilization.

Batched Attention-optimized Speculative Sampling: Innovations in Multi-Sequence Generation for LLMs

Introduction

The evolution of LLM inference methods continues to be a critical area of research, particularly as models are scaled to billions of parameters. Batched Attention-optimized Speculative Sampling (BASS) emerges as a significant enhancement over existing methods, addressing the inefficiencies of speculative decoding with single-sequence batches by introducing a method capable of handling multiple sequences simultaneously. Demonstrating significant improvements in latency, GPU utilization, and generation quality under tight time constraints, BASS optimizes speculative decoding across multiple dimensions.

Key Challenges and Methodology

Existing Limitations

Traditional speculative decoding techniques are largely limited to single-sequence processing, curbing the potential to exploit parallelism in modern GPU hardware. This limitation is particularly impactful in scenario where multiple sequence outputs are required simultaneously, as is common in practical AI applications, where latency and throughput are critical.

BASS Overview

BASS successfully extends speculative decoding beyond these limitations, leveraging batch processes and a novel approach to calculating attention across variable-length sequences within batches. The core technique involves speculative token drafts performed in parallel, with dynamic adjustments based on the acceptability of these drafts, which significantly enhances throughput.

Technical Innovations

  • CUDA Kernels and Ragged Tensors: Handling of ragged tensors via customized CUDA kernels, facilitating efficient memory management and parallel computation.
  • Dynamic Draft Length: Algorithmic innovation to dynamically adjust draft token length, enhancing flexibility and adaptability during inference.

Experimental Setup and Results

Models and Metrics

The system was evaluated using models like OPT 13B and CodeGen-Mono 16B, with metrics including HumanEval pass@k for coding, and ROUGE for summarization tasks.

Findings

With BASS, notable improvements were observed:

  • Latency and Throughput: Achieved up to 2.15 times speed-up over optimized regular decoding methods.
  • GPU Utilization: Enhanced peak usage markedly, showing more than threefold improvement over regular decoding approaches.

These improvements catalyze better performance in applications like coding assistants, conversational agents, and more, without the expense of extended wait times typical of earlier methods.

Implications and Future Directions

Practical Implications

BASS allows real-world AI systems, especially those requiring real-time interaction and generation of multiple responses, to function more efficiently. This capability can transform user experiences across interfaces where rapid responses are essential.

Theoretical Contributions

This research contributes to the understanding of how speculative decoding can be synergized with batch processing to overcome core challenges in AI inference, such as those posed by memory bandwidth and latency.

Future Research

Further exploration into reducing disparities in GPU utilization across different phases of model inference could yield even faster and more efficient systems. Adapting BASS to a wider range of model architectures and sizes, and further optimizing CUDA implementations could also extend the benefits observed.

In conclusion, Batched Attention-optimized Speculative Sampling sets a new benchmark in the utilization and efficiency of LLMs, offering pathways to both theoretical and practical enhancements in the field of AI model inference.