- The paper presents validated in-core performance models comparing throughput, latency, and Write-Allocate evasion for three state-of-the-art CPUs.
- It employs 416 benchmark tests and tools like OSACA and LLVM-MCA to quantify architectural strengths and address performance gaps.
- Results indicate Grace excels in scalar workloads, Sapphire Rapids optimizes vector throughput, and Genoa offers balanced performance for diverse HPC tasks.
An Analysis of Microarchitectural Comparison and In-core Modeling of Grace, Sapphire Rapids, and Genoa CPUs
The paper entitled "Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa" by Jan Laukemann, Georg Hager, and Gerhard Wellein presents a detailed analysis and performance modeling of three contemporary CPU architectures: Nvidia’s Grace CPU, Intel’s Sapphire Rapids, and AMD’s Genoa. Through rigorous examination, the authors offer insights into the differences and capabilities of these CPUs, thereby contributing to the broader understanding of their microarchitectural strengths and weaknesses.
Microarchitectural Analysis
Instruction Stream Throughput Prediction
The core of the paper lies in developing and validating in-core performance models for the CPUs in question. Utilizing the Open Source Architecture Code Analyzer (OSACA) and comparing it with the existing LLVM-MCA, the authors provide detailed performance models for each CPU. These models are crucial in predicting the throughput and latency of various instructions, which subsequently aids in identifying potential performance bottlenecks.
The paper reveals detailed throughput and latency measurements for several double-precision instructions across the three architectures. For instance, the Golden Cove architecture (Sapphire Rapids) exhibits the highest throughput for vector instructions due to its substantial register width, while Neoverse V2 (Grace) demonstrates strength in scalar instructions due to high instruction-level parallelism (ILP). Conversely, Zen 4 (Genoa), with intermediate vector widths and ILP, shows balanced performance across different workloads.
Test and Methodology
The experiments were conducted on dedicated testbeds comprising two-socket servers for each CPU. Performance metrics such as sustained CPU clock frequency and memory bandwidth were scrutinized. Notably, the Grace CPU demonstrated a steady clock frequency of 3.4GHz across all workloads, which indicates robust performance.
The authors leveraged 416 benchmark tests from various categories and optimization scenarios to validate their models. The average relative prediction error (RPE) was significantly lower in OSACA for the Neoverse V2 and Golden Cove architectures, suggesting a more accurate runtime prediction model compared to LLVM-MCA.
Write-Allocate Evasion
A notable feature examined in the paper is the Write-Allocate (WA) evasion. WA, typically encountered in cache-based systems, can be mitigated through special instructions or automatic detection mechanisms. The paper shows that while Sapphire Rapids and Genoa have mechanisms to reduce the WA traffic, only Nvidia’s Grace CPU can entirely avoid WA transfers automatically in certain test scenarios.
Numerical Results and Performance
A comparative analysis of the sustained peak performance in terms of theoretical double-precision peak teraflops (Tflop/s) was performed. Genoa leads with 8.52 Tflop/s in theory but achieves 5.1 Tflop/s in practice, illustrating a performance-efficiency gap. Conversely, the Grace CPU achieves a closer actual performance to its theoretical peak, indicating better utilization of its hardware capabilities.
Implications and Future Directions
The findings elucidate that the Neoverse V2 architecture (Grace) offers high ILP and efficiency, making it suitable for scalar-heavy and non-vectorized workloads prevalent in data centers and AI applications. Meanwhile, Golden Cove’s wide registers benefit vectorized code, which is advantageous for high-performance computing tasks. The balanced nature of Zen 4 provides versatility across varied workloads.
This research holds significant practical implications for optimizing high-performance computing (HPC) applications, taking advantage of each architecture's strengths. The extended OSACA tool presented in this paper enables more accurate performance modeling, which can be utilized in broader node-wide performance models such as the Execution-Cache-Memory (ECM) model.
Future research could explore integrating these in-core models into node-level performance models to predict the performance of complex, real-world applications over large HPC clusters. This integration is crucial for orchestrating computational resources effectively and improving the performance-efficiency of data centers and HPC environments.
In summary, this paper provides a comprehensive comparison and in-core performance model of three state-of-the-art CPUs, revealing their architectural strengths and practical performance implications. The methodology and insights offered pave the way for further advancements in performance modeling and optimization in the field of high-performance computing.