Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization (2403.01136v1)

Published 2 Mar 2024 in cs.LG, cs.AI, and cs.DC
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Abstract: Recent breakthroughs in Large-scale LLMs have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.

Optimizing LLM Deployment on Heterogeneous GPU Clusters with LLM-PQ

In the cutting-edge sphere of generative AI and Large-scale LLMs, the computational and memory demands for effectively serving these models are formidable. This paper introduces LLM-PQ, a system designed to enhance the efficiency of LLM serving on heterogeneous GPU clusters by incorporating adaptive model quantization and phase-aware model partitioning.

Phase-Aware Model Partitioning and Adaptive Quantization

The paper addresses a critical bottleneck in the deployment of LLMs – their immense size and corresponding resource demands. The authors propose a novel approach, LLM-PQ, which stands for Phase-aware Partition and Adaptive Quantization, tailored to optimize LLM serving on heterogeneous GPU clusters.

The key insight is two-fold:

  1. Phase-Aware Partitioning: Recognizes that mainstream LLMs like GPT3 and BERT experience two distinct phases during inference – prompt processing and token generation. By adopting a phase-aware approach to model partition among GPUs, LLM-PQ ensures a more balanced workload distribution, leading to improved resource utilization.
  2. Adaptive Quantization: Diverging from uniform, one-size-fits-all quantization strategies, LLM-PQ adapts the quantization precision based on the memory capacity and computational prowess of individual GPUs within a heterogeneous cluster. This strategy mitigates memory wastage on high-capacity GPUs and reduces the risk of out-of-memory errors on constrained devices.

Experimental Validation

LLM-PQ's efficacy is demonstrated through extensive experimentation across 11 different heterogeneous clusters using production inference workloads. The results are compelling, showcasing up to 2.88x throughput improvement in inference relative to state-of-the-art methods. Such significant gains underscore the utility of adaptive quantization and phase-aware partitioning in enhancing the efficiency of LLM serving.

Theoretical Contributions and Practical Implications

The authors offer a meticulously designed cost model that enables accurate prediction of memory requirements and inference latency under mixed-precision quantization schemes. The introduction of a variance indicator to gauge layer sensitivity towards various levels of quantization emerges as a noteworthy theoretical contribution, facilitating optimal bitwidth selection.

Practically, LLM-PQ has profound implications for cloud-based AI services and machine learning clusters, where heterogeneity of computing resources is common. By optimizing the deployment of LLMs across diverse GPU setups, LLM-PQ paves the way for cost-efficient, high-performance AI applications.

Future Directions

Looking ahead, the integration of LLM-PQ with tensor-parallelism techniques and exploration of its applicability to online serving tasks represent exciting avenues for research. Additionally, the adaptation of LLM-PQ to accommodate emerging quantization schemes could further refine its effectiveness and broaden its applicability.

Conclusion

LLM-PQ stands as a significant advancement in the domain of LLM serving, addressing the challenge of efficiently deploying these colossal models in heterogeneous computing environments. Through intelligent layer partitioning and adaptively adjusting quantization precision, LLM-PQ unlocks new possibilities for leveraging the full potential of mixed-capability GPU clusters, marking a pivotal step forward in the scalable and cost-effective execution of large-scale LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2022.
  2. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  3. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  4. Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, 2021.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  6. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  8. Spqr: A sparse-quantized representation for near-lossless llm weight compression. ArXiv, abs/2306.03078, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Energonai: An inference system for 10-100 billion parameter transformer models. arXiv preprint arXiv:2209.02341, 2022.
  13. Hugging Face. Text generation inference. https://github.com/huggingface/text-generation-inference, n.d. Accessed on: July 24, 2023.
  14. Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022.
  15. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023.
  16. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
  17. Pipeline parallelism for inference on heterogeneous edge computing. ArXiv, abs/2110.14895, 2021.
  18. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  19. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
  20. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
  21. Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978, 2023.
  22. Exact: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2022.
  23. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
  24. Pointer sentinel mixture models, 2016.
  25. NVIDIA. Fastertransformer: Transformer related optimization, including bert, gpt, n.d.
  26. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), August 2016.
  27. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019.
  28. OpenLLM: Operating LLMs in production, 2023.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  30. RyokoAI. Sharegpt52k. https://huggingface.co/datasets/RyokoAI/ShareGPT52K, 2021. Dataset accessed on [insert date].
  31. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
  32. Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  33. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  34. Adaptive message quantization and parallelization for distributed full-graph gnn training. ArXiv, abs/2306.01381, 2023.
  35. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020.
  36. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 2023.
  37. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems, 2022.
  38. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
  39. Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, 2019.
  40. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  41. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Juntao Zhao (6 papers)
  2. Borui Wan (6 papers)
  3. Yanghua Peng (18 papers)
  4. Haibin Lin (35 papers)
  5. Chuan Wu (68 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com