Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms (2404.12674v3)

Published 19 Apr 2024 in cs.DC, cs.LG, and cs.PF

Abstract: Characterizing and predicting the training performance of modern ML workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” CoRR, vol. abs/1906.00091, 2019.
  2. D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.-H. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. Kishore, T. Graf, A. Eisenman, K. K. Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K. Nair, B. Muthiah, M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L. Qiao, M. Smelyanskiy, B. Jia, and V. Rao, “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 993–1011.
  3. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  4. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Houston, S. Tiwary, and B. Catanzaro, “Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, A large-scale generative language model,” CoRR, vol. abs/2201.11990, 2022.
  5. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022.
  6. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “OPT: Open pre-trained transformer language models,” 2022.
  7. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,” 2023.
  8. D. Justus, J. Brennan, S. Bonner, and A. S. McGough, “Predicting the computational cost of deep learning models,” in 2018 IEEE International Conference on Big Data, ser. BigData 2018, Dec. 2018, pp. 3873–3882.
  9. Z. Pei, C. Li, X. Qin, X. Chen, and G. Wei, “Iteration time prediction for CNN in multi-GPU platform: Modeling and analysis,” IEEE Access, vol. 7, pp. 64 788–64 797, 14 May 2019.
  10. S. Li, R. J. Walls, and T. Guo, “Characterizing and modeling distributed training with transient cloud GPU servers,” in 2020 IEEE 40th International Conference on Distributed Computing Systems, ser. ICDCS 2020, Nov. 2020, pp. 943–953.
  11. Y.-C. Liao, C.-C. Wang, C.-H. Tu, M.-C. Kao, W.-Y. Liang, and S.-H. Hung, “PerfNetRT: Platform-aware performance modeling for optimized deep neural networks,” in 2020 International Computer Symposium, ser. ICS 2020, Dec. 2020, pp. 153–158.
  12. G. X. Yu, Y. Gao, P. Golikov, and G. Pekhimenko, “Habitat: A Runtime-Based computational performance predictor for deep neural network training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21).   USENIX Association, Jul. 2021, pp. 503–521. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/yu
  13. A. Rajagopal and C. Bouganis, “perf4sight: A toolflow to model CNN training performance on edge GPUs,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).   Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2021, pp. 963–971.
  14. Z. Lin, L. Feng, E. K. Ardestani, J. Lee, J. Lundell, C. Kim, A. Kejariwal, and J. D. Owens, “Building a performance model for deep learning recommendation model training on GPUs,” in 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics, ser. HiPC 2022.   IEEE, Dec. 2022, pp. 48–58.
  15. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ser. NAACL-HLT 2019, J. Burstein, C. Doran, and T. Solorio, Eds., vol. 1.   Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  16. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, 14 Feb. 2019. [Online]. Available: https://openai.com/blog/better-language-models/
  17. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.
  18. F. Yan, O. Ruwase, Y. He, and T. Chilimbi, “Performance modeling and scalability optimization of distributed deep learning systems,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 1355–1364.
  19. Y. Oyama, A. Nomura, I. Sato, H. Nishimura, Y. Tamatsu, and S. Matsuoka, “Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers,” in 2016 IEEE International Conference on Big Data (Big Data), 2016, pp. 66–75.
  20. H. Qi, E. R. Sparks, and A. Talwalkar, “Paleo: A performance model for deep neural networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.   OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=SyVVJ85lg
  21. M. Wang, C. Meng, G. Long, C. Wu, J. Yang, W. Lin, and Y. Jia, “Characterizing deep learning training workloads on Alibaba-PAI,” in 2019 IEEE International Symposium on Workload Characterization (IISWC).   Los Alamitos, CA, USA: IEEE Computer Society, Nov. 2019, pp. 189–202.
  22. S. Sridharan, T. Heo, L. Feng, Z. Wang, M. Bergeron, W. Fu, S. Zheng, B. Coutinho, S. Rashidi, C. Man, and T. Krishna, “Chakra: Advancing performance benchmarking and co-design using standardized execution traces,” CoRR, vol. abs/2305.14516, 2023.
  23. C. Yang, Z. Li, C. Ruan, G. Xu, C. Li, R. Chen, and F. Yan, “Perfestimator: A generic and extensible performance estimator for data parallel dnn training,” in 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence), 2021, pp. 13–18.
  24. D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.   OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=qrwe7XHTmYb
  25. M. Lui, Y. Yetim, O. Özkan, Z. Zhao, S.-Y. Tsai, C.-J. Wu, and M. Hempstead, “Understanding capacity-driven scale-out neural recommendation inference,” in 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021, pp. 162–171.
  26. G. Sethi, B. Acun, N. Agarwal, C. Kozyrakis, C. Trippel, and C.-J. Wu, “RecShard: Statistical feature-based memory optimization for industry-scale neural recommendation,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’22.   New York, NY, USA: Association for Computing Machinery, 2022, pp. 344–358.
  27. D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y. Tian, J. Chae, Y. Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding table sharding for recommender systems,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 4461–4471.
  28. D. Zha, L. Feng, Q. Tan, Z. Liu, K.-H. Lai, B. Bhushanam, Y. Tian, A. Kejariwal, and X. Hu, “Dreamshard: Generalizable embedding table placement for recommender systems,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=_atSgd9Np52
  29. Meta, “DLRM open-source datasets,” Dec. 2021. [Online]. Available: https://github.com/facebookresearch/dlrm_datasets
  30. A. Tulloch. (2020, May) Batch embedding lookup GPU kernel and more. [Online]. Available: https://github.com/ajtulloch/sparse-ads-baselines
  31. A. Li, S. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker, “Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 01, pp. 94–110, Jan. 2020.
  32. M. S. B. Altaf and D. A. Wood, “LogCA: A high-level performance model for hardware accelerators,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 375–388.
  33. Meta, “PARAM benchmark,” Sep. 2020. [Online]. Available: https://github.com/facebookresearch/param
  34. D. S. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyanskiy, “FBGEMM: enabling high-performance low-precision deep learning inference,” CoRR, vol. abs/2101.05615, 2021.
  35. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45.
  36. Meta, “Comparison between DataParallel and DistributedDataParallel,” May 2023. [Online]. Available: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel
  37. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “PyTorch distributed: Experiences on accelerating data parallel training,” Proc. VLDB Endow., vol. 13, no. 12, pp. 3005–3018, Aug. 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhongyi Lin (3 papers)
  2. Ning Sun (23 papers)
  3. Pallab Bhattacharya (12 papers)
  4. Xizhou Feng (1 paper)
  5. Louis Feng (9 papers)
  6. John D. Owens (36 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com