MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems (2110.11466v2)
Abstract: Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.
- Steven Farrell (18 papers)
- Murali Emani (17 papers)
- Jacob Balma (3 papers)
- Lukas Drescher (2 papers)
- Aleksandr Drozd (10 papers)
- Andreas Fink (2 papers)
- Geoffrey Fox (41 papers)
- David Kanter (13 papers)
- Thorsten Kurth (43 papers)
- Peter Mattson (18 papers)
- Dawei Mu (2 papers)
- Amit Ruhela (3 papers)
- Kento Sato (9 papers)
- Koichi Shirahata (2 papers)
- Tsuguchika Tabaru (5 papers)
- Aristeidis Tsaris (16 papers)
- Jan Balewski (16 papers)
- Ben Cumming (3 papers)
- Takumi Danjo (1 paper)
- Jens Domke (15 papers)