BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Abstract: The advanced capabilities of LLMs have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
- GPT understands, too. AI Open (2023).
- DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7283–7292.
- Towards Robust QA Evaluation via Open LLMs. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2811–2816.
- SUMMEDITS: measuring LLM ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9662–9676.
- Openai. 2024. ChatGPT. Retrieved 2024 from https://openai.com
- Microsoft. 2024. Copilot. Retrieved 2024 from https://copilot.microsoft.com
- Google. 2024. Gemini. Retrieved 2024 from https://gemini.google.com
- Baidu. 2024. ERINE. Retrieved 2024 from https://yiyan.baidu.com
- Alibaba. 2024. Qwen. Retrieved 2024 from https://qianwen.aliyun.com
- Moonshot. 2024. Kimi. Retrieved 2024 from https://kimi.moonshot.cn
- Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017).
- Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation. 117–134.
- Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132.
- NVIDIA. 2024a. FasterTransformer. Retrieved 2024 from https://github.com/NVIDIA/FasterTransformer
- NVIDIA. 2024b. Triton Inference Server. Retrieved 2024 from https://developer.nvidia.com/nvidia-triton-inference-server
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
- Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023).
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- {{\{{DistServe}}\}}: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210.
- Openai. 2024. GPT-2. Retrieved 2024 from https://huggingface.co/openai-community/gpt2
- Meta. 2024. Llama. Retrieved 2024 from https://www.llama2.ai
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving. In Forty-first International Conference on Machine Learning.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023).
- S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Increasing GPU Utilization during Generative Inference for Higher Throughput. Advances in Neural Information Processing Systems 36 (2023), 18015–18027.
- Hugging Face. 2024a. Llama-2-7b-chat-hf. Retrieved 2024 from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- Hugging Face. 2024b. Transformers. Retrieved 2024 from https://github.com/huggingface/transformers
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems. 30016–30030.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
- Google. 2024. PaLM2. Retrieved 2024 from https://ai.google/discover/palm2/
- Response length perception and sequence scheduling: An llm-empowered llm inference pipeline. Advances in Neural Information Processing Systems 36 (2024).
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017 (2023).
- A survey of techniques for optimizing transformer inference. Journal of Systems Architecture (2023), 102990.
- Accelerating transformer networks through recomposing softmax layers. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 92–103.
- A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models. IEEE Computer Architecture Letters (2023).
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
- HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. 1–9.
- Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing. 766–775.
- Understanding int4 quantization for language models: latency speedup, composability, and failure cases. In International Conference on Machine Learning. PMLR, 37524–37539.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
- Batch Prompting: Efficient Inference with Large Language Model APIs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 792–810.
- High-throughput Generative Inference of Large Language Models with a Single GPU, March 2023. arXiv preprint arXiv:2303.06865 (2023).
- Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.