Introducing Milabench: Benchmarking Accelerators for AI (2411.11940v2)

Published 18 Nov 2024 in cs.LG

Abstract: AI workloads, particularly those driven by deep learning, are introducing novel usage patterns to high-performance computing (HPC) systems that are not comprehensively captured by standard HPC benchmarks. As one of the largest academic research centers dedicated to deep learning, Mila identified the need to develop a custom benchmarking suite to address the diverse requirements of its community, which consists of over 1,000 researchers. This report introduces Milabench, the resulting benchmarking suite. Its design was informed by an extensive literature review encompassing 867 papers, as well as surveys conducted with Mila researchers. This rigorous process led to the selection of 26 primary benchmarks tailored for procurement evaluations, alongside 16 optional benchmarks for in-depth analysis. We detail the design methodology, the structure of the benchmarking suite, and provide performance evaluations using GPUs from NVIDIA, AMD, and Intel. The Milabench suite is open source and can be accessed at github.com/mila-iqia/milabench.

Summary

The paper introduces Milabench as a comprehensive suite for benchmarking AI accelerators against real deep learning workloads.
It employs a rigorous literature review and community surveys to select 42 benchmarks spanning 19 model architectures.
The evaluation uncovers vendor-specific performance nuances, highlighting gaps between theoretical FLOPs and practical application outcomes.

Insights into Milabench: An Evaluation Platform for AI Accelerators

The presented paper details the development and capabilities of Milabench, a benchmarking suite tailored for high-performance computing systems with a focus on AI workloads, particularly deep learning models. Established by Mila, a leading academic research center in deep learning research, Milabench emerges as a solution to the complex landscape of AI workload evaluation on diverse hardware configurations. Unlike traditional HPC benchmarks, which are often limited in scope, Milabench aims to provide a comprehensive, representative, and unbiased testing platform for different AI accelerators.

Design and Methodology

Milabench's design is rooted in an extensive literature review of 867 academic papers by Mila researchers and surveys within their community. This exhaustive approach selected 26 primary benchmarks for procurement assessment and 16 additional benchmarks for exhaustive analysis. The paper emphasizes three core pillars in the benchmarking suite's design: simplicity, representativeness, and impartiality. The suite is structured to be user-friendly, allowing for rapid experimentation with a modular interface. It strives to reflect real-world researcher demands by covering a wide array of topics, avoiding vendor-specific biases, and ensuring cross-hardware evaluations.

Benchmark Suite Composition

A notable feature of Milabench is its extensive coverage across AI domains. It evaluates different computational paradigms like NLP, Computer Vision (CV), Reinforcement Learning (RL), and Graph Neural Networks (GNNs) using a variety of model architectures from basic CNNs to sophisticated Transformer models. A total of 19 distinct model architectures are tested, highlighting the diversity and inclusiveness of this suite. Each benchmark is designed to maintain fidelity to typical research pipelines, thereby ensuring realistic workload representation.

The results produced by Milabench span across GPUs offered by prominent vendors like NVIDIA, AMD, and Intel. The suite's open-source nature enables easy accessibility and extendibility, encouraging further academic and industry-wide collaboration.

Performance Evaluation and Findings

The evaluation conducted with Milabench highlights several vendor-specific strengths and weaknesses. For instance, the NVIDIA H100 GPUs demonstrated substantial performance advancements in lower-precision computations, particularly with TF32, outperforming AMD's MI300X and Intel's Gaudi2 in real-world application scenarios, even though synthetic benchmarks indicated different strengths for AMD and Intel.

Milabench results provide critical insights into the current state of AI accelerator performance and indicate a gap between theoretical FLOP capabilities and real-world application deployments. This disparity is attributed to software stack maturity, with newer platforms still grappling to match the well-optimized CUDA ecosystem.

Implications and Future Directions

Milabench's comprehensive approach to benchmarking reinforces the importance of robust evaluation tools for AI hardware selection and development. Its findings signal the necessity for ongoing improvements in software optimizations by vendors, urging them to focus on general support and integration with popular ML libraries, rather than solely proprietary solutions.

For future updates, the researchers aim to enhance automation in the literature review process, refine categorization of model architectures, and consider broader evaluation metrics, including energy efficiency, which is increasingly significant for HPC systems focusing on sustainability.

In conclusion, Milabench stands as a critical benchmark suite that not only evaluates the capabilities of AI accelerators with high fidelity but also aids in guiding hardware procurement strategies and fostering advancements in software support. Its continued evolution will likely provide valuable contributions to aligning hardware performance with rapidly progressing AI research needs.

PDF Markdown

Related Papers

GitHub

GitHub - mila-iqia/milabench: Repository of machine learning benchmarks (32 stars)