Evaluation of the #15 Pre-Exascale Supercomputer
The paper presents a comprehensive analysis of the #15 system, a pre-exascale supercomputer situated at the Barcelona Supercomputing Center (BSC). It is part of the EuroHPC Joint Undertaking initiative, designed to facilitate a multitude of scientific workloads. The supercomputer boasts a peak performance of 314 petaflops, constructed with a hybrid architecture that integrates Intel Sapphire Rapids CPUs, NVIDIA Hopper GPUs, and both DDR5 and high-bandwidth memory (HBM). The system's architecture is segregated into four distinct partitions, each tailored for specific computational tasks.
The evaluation conducted in this paper comprises several tiers of benchmarks aimed at assessing the system’s capabilities. These benchmarks range from low-level architectural evaluations such as micro-benchmarks for floating-point operations and memory bandwidth, to high-level assessments using established High-Performance Computing (HPC) benchmarks like High-Performance Linpack (HPL) and High-Performance Conjugate Gradients (HPCG), as well as practical applications in fluid dynamics, computational mechanics, and climate modeling.
System Architecture and Benchmarks
Architecture Overview:
The #15 system architecture employs a combination of the latest computational technologies and standards. The General Purpose Partition (GPP) utilizes Intel Sapphire Rapids CPUs equipped with DDR5 memory, excelling in memory-intensive operations. The Accelerated Partition (ACC) leverages a combination of Intel CPUs and NVIDIA Hopper GPUs, interconnected with NVLink and PCIe Gen5, making it optimal for GPU-accelerated tasks. Notably, the system is designed to integrate next-generation partitions, enhancing its flexibility and modularity in adapting to technological advancements.
Micro-Benchmarking:
The micro-benchmarks reveal that the system approaches its theoretical performance limits under certain conditions. Notably, single-core FMA operations sustain a peak performance nearly matching theoretical calculations, subject to minimal deviations due to architectural nuances like AVX-512 frequency scaling. The paper also pinpoints that while memory bandwidth utilization remains sub-maximal, HBM nodes achieve a significant bandwidth advantage over DDR nodes, despite the Sapphire Rapids CPUs' current inability to fully exploit this bandwidth.
HPC Benchmarking:
The HPL benchmark results place the GPP and ACC partitions at 22nd and 8th on the Top500 list, respectively, with RMax values representing 89.31% and 69.96% of RPeak. Although these figures underscore the system’s proficiency in handling compute-heavy tasks, they also highlight potential areas for optimization, particularly in scaling and execution efficiency.
The analysis extended to real-world applications such as Alya, OpenFOAM, and the Integrated Forecasting System (IFS). Each application utilized the computational resources differently, displaying unique scalability patterns and efficiency metrics. Alya demonstrated efficient scalability up to mid-range node counts before encountering diminishing returns due to communication overheads. OpenFOAM showed robust scalability up to approximately 1000 nodes but suffered from decreased computational efficiency beyond this range. IFS highlighted inefficiencies particularly in its hybrid MPI/OpenMP setup, largely attributable to load imbalances.
Energy Efficiency and Memory Configurations
A significant aspect of #15's evaluation focused on energy consumption and efficiency, facilitated by the Energy Aware Runtime (EAR) framework. This framework proved essential in scrutinizing the energy dynamics of various computational tasks. Of particular note, the paper found that energy consumption scales predictably with node usage, and the Energy-Delay Product (EDP) provides a nuanced view of optimal computational configurations.
The comparison of DDR5 versus HBM memory configurations elucidated substantial performance gains in memory-intensive operations when using HBM. However, the associated energy costs highlighted the necessity for strategic memory configuration selections based on specific workload requirements.
Conclusions and Implications
The evaluation of #15 underscores both its strengths and the potential for refinement. While the system demonstrates exceptional capabilities for high-performance tasks, it also sheds light on areas needing improved optimization for application scalability and energy efficiency at larger scales. Researchers using #15 are encouraged to leverage these insights to fine-tune their workloads and maximize computational output. Moving forward, the paper also suggests that continuous developments in computational architecture and runtime optimization could significantly elevate the capabilities of systems like #15, further pushing the envelope of exascale computing.