vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training (2312.12391v2)
Abstract: As LLMs become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner. Existing LLM training plans typically employ a heuristic based parallel training strategy which is based on empirical observations rather than grounded upon a thorough examination of the search space of LLM parallelization. Such limitation renders existing systems to leave significant performance left on the table, wasting millions of dollars worth of training cost. This paper presents our profiling-driven simulator called vTrain, providing AI practitioners a fast yet accurate software framework to determine an efficient and cost-effective LLM training system configuration. We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies that balances training time and its associated training cost, efficient multi-tenant GPU cluster schedulers targeting multiple LLM training jobs, and determining a compute-optimal LLM model architecture given a fixed compute budget.
- N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: A Detailed On-Chip Network Model Inside a Full-System Simulator,” in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
- AWS, “Amazon EC2 P4 Instances,” https://aws.amazon.com/ko/ec2/instance-types/p4/, 2023.
- Z. Bian, S. Li, W. Wang, and Y. You, “Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2020.
- S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha, “Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning,” in Proceedings of the EuroSys Conference, 2020.
- M. Cho, U. Finkler, D. Kung, and H. Hunter, “BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy,” Proceedings of Machine Learning and Systems, 2019.
- Y. Choi, Y. Kim, and M. Rhu, “Lazy Batching: An SLA-Aware Batching System for Cloud Machine Learning Inference,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2021.
- Y. Choi and M. Rhu, “PREMA: A Predictive Multi-Task Scheduling Algorithm for Preemptible Neural Processing Units,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2020.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “PaLM: Scaling Language Modeling with Pathways,” arXiv preprint arXiv:2204.02311, 2022.
- J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, Y. Guo, X. Jiang, L. Tang, Y. Du, Y. Zhang, P. Pan, and Y. Xie, “EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform,” in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2020.
- S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang, L. Xia, L. Diao, X. Liu, and W. Lin, “DAPPLE: A Pipelined Data Parallel Approach for Training Large Models,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2021.
- W. Gao, Z. Ye, P. Sun, Y. Wen, and T. Zhang, “Chronus: A Novel Deadline-Aware Scheduler for Deep Learning Training Jobs,” in Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2021.
- GitHub, “GitHub Copilot,” https://github.com/features/copilot, 2023.
- Google Cloud, “Introduction to Generative AI Studio,” https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio, 2023.
- D. Gu, Y. Zhao, Y. Zhong, Y. Xiong, Z. Han, P. Cheng, F. Yang, G. Huang, X. Jin, and X. Liu, “ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023.
- J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A GPU Cluster Manager for Distributed Deep Learning,” in Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
- J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training Compute-Optimal Large Language Models,” arXiv preprint arXiv:2203.15556, 2022.
- Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, “Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
- Q. Hu, M. Zhang, P. Sun, Y. Wen, and T. Zhang, “Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023.
- Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and z. Chen, “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,” Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2019.
- C. Hwang, T. Kim, S. Kim, J. Shin, and K. Park, “Elastic Resource Sharing for Distributed Deep Learning,” in Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2021.
- M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, “Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads,” in Proceedings of the USENIX Conference on Usenix Annual Technical Conference, 2019.
- Z. Jia, M. Zaharia, and A. Aiken, “Beyond Data and Model Parallelism for Deep Neural Networks.” in Proceedings of Machine Learning and Systems, 2019.
- J. Lee, I. Hwang, S. Shah, and M. Cho, “FlexReduce: Flexible All-Reduce for Distributed Deep Learning on Asymmetric Network Topology,” in Proceedings of ACM/IEEE Design Automation Conference (DAC), 2020.
- D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,” arXiv preprint arXiv:2006.16668, 2020.
- S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “PyTorch Distributed: Experiences on Accelerating Data Parallel Training,” arXiv preprint arXiv:2006.15704, 2020.
- Q. Luo, J. He, Y. Zhuo, and X. Qian, “Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.
- K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman, A. Akella, A. Phanishayee, and S. Chawla, “Themis: Fair and Efficient GPU Cluster Scheduling,” in Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2020.
- Meta AI, “Introducing the Introducing the AI Research SuperCluster ,” https://ai.facebook.com/blog/ai-rsc/, 2022.
- Microsoft, “Azure OpenAI Service,” https://azure.microsoft.com/ko-kr/products/ai-services/openai-service, 2023.
- Microsoft, “ElasticFlow Traces,” https://github.com/microsoft/elasticflow-traces, 2023.
- Microsoft, “Megatron-DeepSpeed,” https://github.com/microsoft/Megatron-DeepSpeed, 2023.
- D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized Pipeline Parallelism for DNN Training,” in Proceedings of the ACM Symposium on Operating System Principles (SOSP), 2019.
- D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-Efficient Pipeline-Parallel DNN Training,” in Proceedings of the International Conference on Machine Learning (ICML), 2021.
- D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia, “Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020.
- D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
- NVIDIA, “NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer,” https://nvidianews.nvidia.com/news/nvidia-microsoft-accelerate-cloud-enterprise-ai, 2022.
- NVIDIA, “NVIDIA Collective Communications Library (NCCL),” "https://developer.nvidia.com/nccl", 2023.
- NVIDIA, “NVIDIA DGX A100,” https://www.nvidia.com/en-us/data-center/dgx-a100/, 2023.
- NVIDIA, “NVLink and NVSwitch,” https://www.nvidia.com/en-us/data-center/nvlink/, 2023.
- NVIDIA, “Performance reported by NCCL tests,” https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md, 2023.
- NVIDIA Developer, “Scaling Language Model Training to a Trillion Parameters Using Megatron,” https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/, 2021.
- NVIDIA Developer, “CUPTI,” https://docs.nvidia.com/cuda/cupti/index.html, 2023.
- OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
- OpenAI, “OpenAI Codex,” https://openai.com/blog/openai-codex, 2023.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training Language Models to Follow Instructions with Human Feedback,” Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2022.
- S. J. Park, J. Fried, S. Kim, M. Alizadeh, and A. Belay, “Efficient Strong Scaling Through Burst Parallel Training,” Proceedings of Machine Learning and Systems, 2022.
- P. Patarasuk and X. Yuan, “Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations,” Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117–124, 2009.
- Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, “Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters,” in Proceedings of the EuroSys Conference, 2018.
- PyTorch, “PyTorch Documentation,” "https://pytorch.org/docs/versions.html", 2023.
- A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho, H. Zhang, G. R. Ganger, and E. P. Xing, “Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2021.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI blog, 2019.
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations Toward Training Trillion Parameter Models,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
- S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
- S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms,” in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2020.
- S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2022.
- J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
- J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He, “ZeRO-Offload: Democratizing Billion-Scale Model Training,” in Proceedings of the USENIX Conference on Usenix Annual Technical Conference, 2021.
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training Multi-Billion Parameter Language Models using Model Parallelism,” arXiv preprint arXiv:1909.08053, 2019.
- S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zhang, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro, “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, a Large-Scale Generative Language Model,” arXiv preprint arXiv:2201.11990, 2022.
- J. Song, J. Yim, J. Jung, H. Jang, H.-J. Kim, Y. Kim, and J. Lee, “Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023.
- S. J. Subramanya, D. Arfeen, S. Lin, A. Qiao, Z. Jia, and G. R. Ganger, “Sia: Heterogeneity-Aware, Goodput-Optimized ML-Cluster Scheduling,” in Proceedings of the ACM Symposium on Operating System Principles (SOSP), 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint arXiv:2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv preprint arXiv:2307.09288, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2017.
- M. Wang, C.-c. Huang, and J. Li, “Supporting Very Large Models using Automatic Dataflow Graph Partitioning,” in Proceedings of the EuroSys Conference, 2019.
- W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, “Gandiva: Introspective Cluster Scheduling for Deep Learning,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
- W. Xiao, S. Ren, Y. Li, Y. Zhang, P. Hou, Z. Li, Y. Feng, W. Lin, and Y. Jia, “AntMan: Dynamic Scaling on GPU Clusters for Deep Learning,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “OPT: Open Pre-Trained Transformer Language Models,” arXiv preprint arXiv:2205.01068, 2022.
- Y. Zhao, Y. Liu, Y. Peng, Y. Zhu, X. Liu, and X. Jin, “Multi-Resource Interleaving for Deep Learning Training,” in Proceedings of the ACM SIGCOMM Conference, 2022.
- L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica, “Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022.
- H. Zhu, A. Phanishayee, and G. Pekhimenko, “Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training,” in Proceedings of the USENIX Conference on Usenix Annual Technical Conference, 2020.
- Jehyeon Bang (2 papers)
- Yujeong Choi (8 papers)
- Myeongwoo Kim (2 papers)
- Yongdeok Kim (2 papers)
- Minsoo Rhu (30 papers)