Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vidur: A Large-Scale Simulation Framework For LLM Inference (2405.05465v2)

Published 8 May 2024 in cs.LG, cs.AI, and cs.CL
Vidur: A Large-Scale Simulation Framework For LLM Inference

Abstract: Optimizing the deployment of LLMs is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.

An Overview of Vidur: A Large-Scale Simulation Framework for LLM Inference

The paper "Vidur: A Large-Scale Simulation Framework for LLM Inference," presents a detailed simulation framework designed to address the complexities and cost inefficiencies associated with optimizing the deployment of LLMs. The authors outline both the necessity of such a framework and the specific implementation nuances that set Vidur apart from existing simulation tools.

Context and Problem Statement

LLMs are foundational to modern NLP tasks, with applications in models like GPT-4 and LLaMA. Despite their capabilities, their inference process remains computationally expensive. Optimizing the deployment of these models typically requires experimental runs across a vast configuration space, including parallelization strategies, batching techniques, and scheduling policies. These experiments are resource-intensive and cost-prohibitive, warranting a more efficient approach to performance modeling.

The Vidur Framework

Vidur is introduced as a simulation framework for high-fidelity, large-scale LLM inference performance modeling. The framework simulates LLM operators using experimental profiling combined with predictive modeling to estimate key metrics such as latency and throughput. Vidur's high accuracy is evidenced by its ability to predict inference latency within a 9\% error margin.

Vidur also includes Vidur-Search, a configuration search tool that optimizes LLM deployment configurations to maximize cost-effectiveness while meeting application performance constraints. For instance, Vidur-Search significantly reduces the cost and time of finding an optimal configuration for the LLaMA2-70B model, highlighting the framework's practical implications.

Key Contributions and Components

1. Architectural Insights

Vidur leverages the architectural similarities among LLMs to streamline the profiling process. By decomposing LLMs into token-level, sequence-level, and communication operators, Vidur minimizes the profiling workload and enhances the accuracy of runtime predictions for unprofiled input sizes.

2. Profiling and Runtime Estimation

The framework employs a sophisticated profiling mechanism that categorizes operations based on their computational dependencies, such as the context length for sequence-level operations. To predict runtime for unprofiled input sizes, Vidur trains random forest regression models, balancing data frugality and prediction accuracy.

3. Hierarchical Scheduling

Vidur's hierarchical scheduler supports multiple batching strategies and memory management capabilities. It integrates various scheduling policies, including vLLM, Orca+, and Sarathi-Serve, providing flexibility in simulating diverse deployment scenarios.

Evaluation and Fidelity

The evaluation demonstrates Vidur's high fidelity across multiple models and workloads. Static and dynamic workload simulations show that Vidur can predict end-to-end performance metrics with less than 5\% median error, even at high system capacities. This accuracy is crucial for production environments where slight misconfigurations can lead to significant cost inefficiencies.

Practical Implications and Future Work

Vidur's ability to deliver accurate performance metrics quickly and cost-effectively has profound implications for the deployment of LLMs. By significantly reducing the barriers to exploring optimal configurations, Vidur enables LLM inference providers to efficiently scale and adapt to new models and workloads.

Looking ahead, Vidur's extensibility suggests potential enhancements, such as supporting more parallelism strategies and integrating energy consumption metrics. These developments could further refine its utility in diverse computational environments.

Conclusion

Vidur represents a substantial advancement in the simulation and optimization of LLM inference. Its precise modeling capabilities, coupled with the practical benefits of Vidur-Search, provide a powerful tool for navigating the complex landscape of LLM deployments. This work not only addresses current inefficiencies but also lays the groundwork for future improvements in LLM-inference optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. arxiv.org e-print archive. https://arxiv.org/.
  2. Cupti: Cuda toolkit documentation. https://docs.nvidia.com/cuda/cupti/index.html.
  3. Faster Transformer. https://github.com/NVIDIA/FasterTransformer.
  4. Google duet ai. https://workspace.google.com/solutions/ai/.
  5. Microsoft copilot. https://www.microsoft.com/en-us/microsoft-copilot.
  6. https://github.com/vllm-project/vllm.
  7. LightLLM: A python-based large language model inference and serving framework. https://github.com/ModelTC/lightllm, 2023.
  8. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  9. Taming throughput-latency tradeoff in llm inference with sarathi-serve. 2024.
  10. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  13. cudnn: Efficient primitives for deep learning, 2014.
  14. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  615–621, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097.
  15. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  16. Flash-decoding for long-context inference, 2023.
  17. Proteus: Simulating the performance of distributed DNN training. CoRR, abs/2306.02267, 2023. doi: 10.48550/arXiv.2306.02267. URL https://doi.org/10.48550/arXiv.2306.02267.
  18. Speed: Speculative pipelined execution for efficient decoding, 2023.
  19. Discourse centric evaluation of machine translation with a densely annotated parallel corpus. In Proceedings of the 2023 Conference of the Association for Computational Linguistics: Human Language Technologies, pp.  1550–1565, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.main.111. URL https://aclanthology.org/2023.acl-main.111.
  20. Efficient memory management for large language model serving with pagedattention. In Flinn, J., Seltzer, M. I., Druschel, P., Kaufmann, A., and Mace, J. (eds.), Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pp.  611–626. ACM, 2023. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
  21. Textbooks are all you need ii: phi-1.5 technical report. September 2023. URL https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need-ii-phi-1-5-technical-report/.
  22. Terapipe: Token-level pipeline parallelism for training large-scale language models, 2021.
  23. Building a performance model for deep learning recommendation model training on gpus. In 29th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022, Bengaluru, India, December 18-21, 2022, pp.  48–58. IEEE, 2022. doi: 10.1109/HiPC56025.2022.00019. URL https://doi.org/10.1109/HiPC56025.2022.00019.
  24. NVIDIA Corporation. CUBLAS library. https://docs.nvidia.com/cuda/cublas/index.html, a.
  25. NVIDIA Corporation. Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, b.
  26. The inference cost of search disruption – large language model cost analysis, 2023.
  27. Splitwise: Efficient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
  28. Efficiently scaling transformer inference, 2022.
  29. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  30. Astra: Exploiting predictability to optimize deep learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp.  909–923, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362405. doi: 10.1145/3297858.3304072. URL https://doi.org/10.1145/3297858.3304072.
  31. Team, I. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  32. Llama: Open and efficient foundation language models, 2023a.
  33. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  34. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  35. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, apr 2009. ISSN 0001-0782. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785.
  36. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.  595–610, Carlsbad, CA, October 2018. USENIX Association. ISBN 978-1-939133-08-3. URL https://www.usenix.org/conference/osdi18/presentation/xiao.
  37. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp.  521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu.
  38. Habitat: A runtime-based computational performance predictor for deep neural network training. In Calciu, I. and Kuenning, G. (eds.), 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pp.  503–521. USENIX Association, 2021. URL https://www.usenix.org/conference/atc21/presentation/yu.
  39. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
  40. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv preprint arXiv:2401.09670, 2024.
  41. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In Gavrilovska, A. and Zadok, E. (eds.), 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, pp.  337–352. USENIX Association, 2020. URL https://www.usenix.org/conference/atc20/presentation/zhu-hongyu.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Amey Agrawal (10 papers)
  2. Nitin Kedia (3 papers)
  3. Jayashree Mohan (17 papers)
  4. Ashish Panwar (8 papers)
  5. Nipun Kwatra (18 papers)
  6. Bhargav Gulavani (2 papers)
  7. Ramachandran Ramjee (20 papers)
  8. Alexey Tumanov (30 papers)
Citations (14)