MLPerf Training Benchmark (1910.01500v3)

Published 2 Oct 2019 in cs.LG, cs.PF, and stat.ML

Abstract: Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.

Citations (286)

View on Semantic Scholar

Summary

The paper establishes a standardized benchmarking process for ML training using a 'time to train' metric that balances speed and accuracy.
The methodology includes a detailed submission framework with closed and open divisions to ensure fair system comparisons and reproducibility.
Key results reveal significant performance improvements and scalability, demonstrating the benchmark's role in driving innovation across diverse ML tasks.

An Overview of the MLPerf Training Benchmark

The paper presents the MLPerf Training Benchmark, an initiative designed to establish standardized performance evaluations for ML systems. The benchmark addresses the unique challenges posed by ML training, highlighting the need for a fair and effective benchmarking process to support the diverse and rapidly evolving landscape of software and hardware solutions in the industry.

Key Challenges and Solutions

MLPerf confronts several challenges that distinguish ML training benchmarks from those in other computational domains. Notably, ML training is influenced by optimizations that can paradoxically prolong the time to solution, intrinsic stochasticity in training processes leading to high variance, and the diverse hardware and software ecosystem that challenges comparability.

To overcome these challenges, MLPerf defines comprehensive benchmarking goals, including enabling fair system comparisons while fostering innovation, promoting reproducibility, and supporting both commercial and research communities. The paper explains how MLPerf builds on prior ML benchmarking efforts like DAWNBench by integrating broader benchmarks, an end-to-end training metric, and the support of a consortium akin to industry standards like SPEC.

Benchmark Suite and Metrics

The MLPerf Training benchmark suite consists of diverse ML tasks, selected for their commercial and research relevance across areas such as vision, language, recommendation, and reinforcement learning. The benchmarks include well-known models and data sets such as ResNet-50 for image classification and GNMT for translation.

The core performance metric adopted by MLPerf is the "time to train" to a specified quality level. This metric considers both speed and accuracy, providing a comprehensive measure of system performance.

Submission and Evaluation Process

The MLPerf submission process is structured to ensure fairness and reproducibility. It includes detailed system descriptions, training log files, and compliance checks against reference implementations. The benchmark is divided into closed and open divisions— the former for standardized comparisons and the latter for promoting innovative solutions.

Submissions are categorized based on their availability status, distinguishing between available, preview, and research systems. This classification allows MLPerf to accommodate cutting-edge research systems and pre-release prototypes while maintaining a clear distinction between mature products and nascent technologies.

Results and Impact

The paper reports significant performance enhancements between the MLPerf v0.5 and v0.6 rounds, demonstrating the benchmark's role in driving improvements. For instance, there was a notable increase in the best 16-chip system's speed by 1.3 times on average, accompanied by an increase in target quality thresholds.

The benchmark also revealed substantial scaling improvements, with the number of necessary chips for optimal performance increasing. These advancements underscore MLPerf’s efficacy in pushing for better implementations and software stacks and hinting at potential hardware innovations in the future.

Theoretical and Practical Implications

MLPerf Training provides essential insights into ML benchmarking and infrastructure. It highlights the importance of realistic data sets and the influence of hyperparameter choices on performance. Additionally, the benchmark underscores the need for precise definitions of model architectures and training procedures to facilitate meaningful comparisons.

The MLPerf initiative is a vital step towards creating a standardized benchmark that can evolve with the advancements in ML, thus providing lasting value to the community. Its structured approach and collaborative development process have attracted significant industry support, indicating its potential to become an industry-standard benchmark.

Overall, the MLPerf Training Benchmark represents a significant effort to address the complexities of ML benchmarking in a comprehensive and systematic manner, offering a robust platform for evaluating and improving ML training systems.

Related Papers

MLPerf Tiny Benchmark (2021)
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems (2021)
MLPerf Mobile Inference Benchmark (2020)
MLPerf Inference Benchmark (2019)
Demystifying the MLPerf Benchmark Suite (2019)

GitHub

GitHub - mlcommons/training: Reference implementations of MLPerf™ training benchmarks (1,687 stars)

Tweets

https://twitter.com/permutans/status/1878818431107301762