AQUA: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains (2407.21255v3)
Abstract: Inference on large-LLMs is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request patterns. We evaluate Aqua by hosting several state-of-the-art large generative ML models of different modalities on servers with 8 Nvidia H100 80G GPUs. Aqua improves the responsiveness of LLM inference by 20X compared to the state-of-the-art and improves LLM inference throughput over a single long prompt by 4X.
- Pytorch: An open source machine learning framework that accelerates the path from research prototyping to production deployment. https://pytorch.org. Accessed: 2024-05-03.
- Host congestion control. In Proceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM ’23, page 275–287, New York, NY, USA, 2023. Association for Computing Machinery.
- Meta AI. Llama 2. https://huggingface.co/docs/transformers/main/model_doc/llama2, 2023.
- Meta AI. Codellama. https://github.com/Meta-Llama/codellama, 2024.
- Mistral AI. Mistral. https://huggingface.co/docs/transformers/main/model_doc/mistral, 2023.
- Mistral AI. Mixtral. https://huggingface.co/docs/transformers/model_doc/mixtral, 2023.
- Together AI. Together: The fastest cloud platform for building and running generative ai. https://www.together.ai/, 2024. Accessed: 2024-03-16.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’22. IEEE Press, 2022.
- Amazon Web Services (AWS). Generative ai-use cases and resources. https://aws.amazon.com/generative-ai/use-cases/, 2024. Accessed on March 16, 2024.
- Microsoft Azure. Azure ai studio. Accessed on March 16, 2024. Available at: https://azure.microsoft.com/en-us/products/ai-studio/.
- Orion: Interference-aware, fine-grained gpu sharing for ml applications. In EuroSys 2024.
- PipeSwitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499–514. USENIX Association, November 2020.
- A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.
- The future of AI is Wafer-Scale, 2021. https://www.cerebras.net/product-chip/.
- Punica: Multi-tenant lora serving, 2023.
- Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Facebook. Link to opt-30b. https://huggingface.co/facebook/opt-30b. Accessed: 2024-05-03.
- Google. Parti-prompts. https://huggingface.co/datasets/nateraw/parti-prompts, 2024. Accessed: 2024-05-02.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
- Gurobi. Gurobi. https://www.gurobi.com/. Accessed: 2024-05-03.
- Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024.
- Lora: Low-rank adaptation of large language models, 2021.
- Hugging Face. Inference client - hugging face hub documentation. https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client, 2024. Accessed: 2024-05-04.
- Huggingface. Huggingface diffusers. https://huggingface.co/docs/diffusers/en/index. Accessed: 2024-05-03.
- Huggingface. Link to mteb: Massive text embedding benchmark. https://huggingface.co/intfloat/e5-mistral-7b-instruct/tree/main/lora. Accessed: 2024-05-03.
- Huggingface. Link to zephyr: Direct distillation of lm alignment. https://huggingface.co/typeof/zephyr-7b-beta-lora. Accessed: 2024-05-03.
- Huggingface. Widely used image generation models. https://huggingface.co/docs/diffusers/main/en/using-diffusers/img2img#popular-models, 2024. Accessed: 2024-04-29.
- Intel Gaudi AI accelerator, 2021. https://habana.ai/products/gaudi/.
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023.
- Audiogen: Textually guided audio generation, 2023.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- Microsoft. Ai is transforming businesses, 2024. https://azure.microsoft.com/en-us/blog/azure-openai-service-10-ways-generative-ai-is-transforming-businesses/, accessed on March 16, 2024.
- Efficient estimation of word representations in vector space, 2013.
- MosaicML. Mpt. https://huggingface.co/docs/transformers/main/model_doc/mpt, 2023.
- Using deepspeed and megatron to train megatron-turing nlg 530b, the world’s largest and most powerful generative language model. https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/, 2021. Accessed October 2021.
- Mteb: Massive text embedding benchmark, 2023.
- Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, page 327–341, New York, NY, USA, 2018. Association for Computing Machinery.
- NVIDIA. Nvidia cuda unified virtual memory. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html. Accessed: 2024-05-03.
- NVIDIA. Nvlink & nvswitch: Fastest hpc data center platform. https://www.nvidia.com/en-us/data-center/nvlink/, 2024. Accessed: 2024-03-16.
- NVIDIA Corporation. Nvidia b200. https://www.nvidia.com/en-us/data-center/dgx-b200/, 2024. Accessed: 2024-04-28.
- NVIDIA Corporation. Nvidia dgx a100 datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf, 2024. Accessed: 2024-04-21.
- NVIDIA Corporation. Nvidia h100 tensor core gpu datasheet. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet, 2024. Accessed: 2024-04-21.
- Nvidia DGX Systems, 2021. https://www.nvidia.com/en-us/data-center/dgx-systems/.
- OpenAI. Openai api. Accessed on March 16, 2024. Available at: https://openai.com/blog/openai-api.
- OpenAI. Openai: Chatgpt window size for 3.5. https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4. Accessed: 2024-05-03.
- OpenAI. Openai customer stories. https://openai.com/customer-stories. Accessed on March 16, 2024.
- Predibase. Lorax: The open-source framework for serving 100s of fine-tuned llms. https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in. Accessed: 2024-03-16.
- PyTorch. Pytorch tensor.to() interface. https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch.Tensor.to, 2024. Accessed: 2024-05-06.
- High-resolution image synthesis with latent diffusion models, 2022.
- Amazon Web Services. Amazon sagemaker. Accessed on March 16, 2024. Available at: https://aws.amazon.com/sagemaker/.
- Amazon Web Services. Mewtant case study, 2024. urlhttps://aws.amazon.com/solutions/case-studies/mewtant-case-study/, accessed on 2024-03-16.
- Flexgen: High-throughput generative inference of large language models with a single gpu, 2023.
- Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. In Sigcomm ’15, 2015.
- The Linux Kernel Developers. Cfs scheduler. https://docs.kernel.org/scheduler/sched-design-CFS.html. Accessed: 2024-03-17.
- Zephyr: Direct distillation of lm alignment, 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- vLLM. Lora in vllm. https://docs.vllm.ai/en/latest/models/lora.html. Accessed: 2024-03-16.
- Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, Carlsbad, CA, October 2018. USENIX Association.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
- Salus: Fine-grained gpu sharing primitives for deep learning applications, 2019.