BASS: Batched Attention-optimized Speculative Sampling (2404.15778v2)
Abstract: Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting LLMs. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Generating long sequences with sparse transformers. URL https://openai.com/blog/sparse-transformers.
- Elias Frantar and Dan Alistarh. 2023. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
- Breaking the sequential dependency of llm inference using lookahead decoding. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding.
- A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322.
- Scaling laws for neural language models. CoRR, abs/2001.08361.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- Language models are unsupervised multitask learners.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems.
- The synergy of speculative decoding and batching in serving large language models. arXiv preprint arXiv:2310.18813.
- Attention is all you need. In Advances in Neural Information Processing Systems.
- Speculative decoding: Lossless speedup of autoregressive translation.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Haifeng Qian (27 papers)
- Sujan Kumar Gonugondla (6 papers)
- Sungsoo Ha (4 papers)
- Mingyue Shang (13 papers)
- Sanjay Krishna Gouda (7 papers)
- Ramesh Nallapati (38 papers)
- Sudipta Sengupta (7 papers)
- Xiaofei Ma (31 papers)
- Anoop Deoras (21 papers)