Papers
Topics
Authors
Recent
Search
2000 character limit reached

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Published 5 May 2026 in cs.DC and cs.AR | (2605.04178v1)

Abstract: Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the model captures Tensor Memory (TMEM), asynchronous bulk copy (TMA), and 5th-generation tensor cores; for CDNA3, the model captures Infinity Cache hierarchy, VGPR constraints, and occupancy. Validation yields 1.31% MAE on B200 (21 kernels) and 0.09% on MI300A (27 kernels), while naive roofline baselines exceed 95% error on the same kernels. We further validate the models using Rodinia~3.1 and SPEChpc 2021 Tiny.The models are updated with HBM bandwidth, capacity, and cache parameters and applied to H200 (Hopper) and MI250X (CDNA2), indicating no major restructuring of the models are needed. All models and benchmarks will be released as open-source upon acceptance.

Summary

  • The paper presents microbenchmark-based models achieving 1.31% MAE for Blackwell and 0.09% MAE for MI300A compared to over 95% error by roofline models.
  • The methodology leverages stage-centric and wavefront-centric approaches to accurately model complex GPU pipelines, memory hierarchies, and synchronization overhead.
  • The models are portable across architectures through parameter tuning, enabling rapid adaptation and optimization for emerging GPU designs.

Microbenchmark-Driven Analytical Performance Modeling for NVIDIA Blackwell and AMD CDNA3 GPUs

Introduction

The paper "Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures" (2605.04178) delivers a rigorous methodology and set of validated analytical models for the execution time of kernels on recent NVIDIA (Blackwell B200) and AMD (CDNA3 MI300A) GPU architectures. The work addresses the increasing architectural complexity of state-of-the-art GPUs—including multi-level memory hierarchies, dedicated matrix execution units, and multiple precision formats—which render traditional model approaches, such as roofline analysis, incapable of accurately predicting practical kernel performance. By grounding model parameters in empirical microbenchmark results rather than vendor datasheet peaks, the presented models achieve unprecedented error margins, with mean absolute errors (MAE) of 1.31% for the Blackwell B200 and 0.09% for the MI300A, compared to over 95% for roofline baselines.

Prior GPU performance models, especially the roofline model and its cache-aware extensions, fail to capture critical pipeline serialization, memory hierarchy non-uniformity, and explicit/implicit overlap of compute and data movement prevalent in modern GPU architectures. Earlier approaches such as the MWP/CWP framework and simulation-based platforms (e.g., Accel-Sim) either lack parameter transparency for new devices or are cost-prohibitive at large scale. Analytical and microarchitecture characterization efforts have not previously closed the cross-architecture gap or quantified error across major vendors. In contrast, these new models are explicitly tied to microbenchmarks, supporting rapid adaptation to new architectures without re-deriving the model structure. Cross-vendor validation and parameter-only portability represent distinctive advances over prior work.

Model Formulation

Blackwell B200: Stage-Centric Model

The Blackwell model acknowledges:

  • Multi-stage pipeline involving Tensor Memory Accelerator (TMA), TMEM (256 KB/SM), fifth-generation tensor cores, and explicit synchronization.
  • TMEM's pivotal role as a hardware-managed accumulator and buffer for tensor operations. The performance-critical path is modeled by chaining stage latencies and overlapped data-movement, as opposed to selecting a roofline-imposed minimum.

Key model features:

  • Analytical execution time as

Texec=max(Tcompute,Tmemory)+ToverheadT_{\mathrm{exec}} = \max(T_{\mathrm{compute}}, T_{\mathrm{memory}}) + T_{\mathrm{overhead}}

  • All pipeline stages—TMA bulk-copy, decompression, TMEM read/write, tensor core computation, synchronization—are parameterized using microbenchmarks.
  • The model supports both single-SM and 2-SM cooperative execution, which is critical given the Blackwell execution pairing for certain workloads.
  • Launch and synchronization overheads, critical for short kernels, are explicitly separated from compute and I/O paths.

AMD CDNA3 MI300A: Wavefront-Centric Model

For MI300A, the modeling paradigm is fundamentally altered due to:

  • Occupancy-driven, implicitly pipelined execution and a cache hierarchy built around a large (256 MB) Infinity Cache, enabling intermediate regimes between fully cache- and HBM-bounded behavior.
  • MFMA (matrix fused multiply-accumulate) compute bound by VGPR residency and explicit dependency on wavefront occupancy.

Key features:

  • Memory time integrates the hit rates at each cache level, as measured by microbenchmarks, and accounts for dynamic working set sizes.
  • Effective overlap is determined by the minimum of active wavefronts, memory, or compute pipeline resources.
  • Tile size selection and kernel fusion are directly incorporated, and cases such as multi-kernel interference and multi-GPU scaling are handled by empirically fitted terms.
  • The model's hierarchical nature enables segmental analysis and cross-comparison with different architectures.

Portability

The same formal model applies to NVIDIA Hopper H200 and AMD MI250X, with only parameter file changes—there is no need to re-derive execution-time formulas unless a structural paradigm shift arises in the microarchitecture.

Model Validation and Results

Microbenchmark and HPC Benchmark Evaluation

Validation used custom microbenchmarks capturing all hardware-exposed bandwidths, latencies, and occupancies, along with HPC proxy applications from Rodinia 3.1 and SPEChpc 2021 Tiny. Key findings include:

  • B200 Model: Achieves 1.31% MAE (21 kernels), with class-wise error for vector operations (8%) and FP16/FP8 tensor GEMM (5.4% with default bandwidth limits).
  • MI300A Model: Calibrated model achieves 0.09% MAE (27 kernels); uncalibrated remains under 8%. For regular workloads, MAE is consistently below 1%. For irregular access or short-duration kernels, error rises but remains far below roofline-baseline errors.
  • Naive roofline model: Yields error consistently above 95% for all tested architectures and workloads, severely underestimating kernel times due to failure to model pipeline serialization, realistic memory-bandwidth utilization, and occupancy.
  • Cross-platform: Porting parameters only (no structural changes) achieves 4.7% (MI250X) to 9.6% (H200) MAE; above 30-40% MAE indicates a need for recharacterization.
  • SPEChpc first-principles vs profiler-derived inputs: Profiler-based characterization yields low MAE; algorithm-derived modeling struggles (errors up to 99%) due to divergence between source-level and actual compiled GPU kernel workload, often by three orders of magnitude.

Numerical and Structural Highlights

  • The Blackwell model's structured TMEM management enables high-fidelity modeling for matrix workloads (e.g., LLM inference/GEMM), which is unattainable with prior approaches.
  • For MI300A, model fidelity is strongly tied to an accurate characterization of Infinity Cache hit rates and VGPR occupancy. VGPR-driven tile selection, occupancy, and hierarchy-sensitive bandwidth all dynamically affect performance in ways not capturable by single-axis roofline or linear models.
  • The explicit parameterization ensures straightforward extensibility for new hardware; model formulas remain unaltered unless fundamental execution resources (e.g., new accumulation paths) are added in new GPU generations.

Discussion and Implications

Practical Utility

  • These models enable precise procurement-time performance prediction and enable autotuning at a granularity that was previously limited to simulation-based analysis.
  • Model error is directly traceable to workload characterization error (FLOP/byte misestimation) rather than shortcomings in architectural modeling.
  • Experts can use these models to evaluate design tradeoffs in co-design, plan for kernel fusion, and devise optimal tile sizes or occupancy tuning strategies, with immediate adaptation to new hardware generations.

Limitations

  • Applicability is limited in the presence of highly irregular memory accesses, aggressive kernel fusion with unpredictable control flow, or ultra-short kernels dominated by synchronization and launch overhead.
  • For maximal fidelity in non-native platforms (e.g., running MI300A segment files on H200), platform-specific recharacterization is required.

Theoretical Implications and Future Directions

  • The bottleneck in analytical performance modeling is increasingly shifting from model sophistication to workload characterization accuracy, especially in directive-offload (OpenACC/OpenMP) environments where kernel structure is compiler-dependent.
  • Future architectures with additional implicit scheduling, new memory subsystems, or further hardware heterogeneity may necessitate hybrid analytical/ML models for residual error correction.
  • Extension to power/energy modeling and additional accelerator families (e.g., NVIDIA Rubin, AMD CDNA4, Intel Gaudi3) is immediate due to the parameter-driven workflow.

Conclusion

The paper presents validated, microbenchmark-driven analytical models for execution time prediction on NVIDIA Blackwell and AMD CDNA3 GPUs, achieving order-of-magnitude improvements in accuracy over roofline baselines. The models are structurally modular, cross-vendor portable, and directly parameterized by empirical measurement. Their performance implicates TMEM and pipelined execution as critical in NVIDIA Blackwell, and cache/occupancy coordination in AMD MI300A. These findings have direct implications for optimization, auto-tuning, and fair benchmarking of contemporary HPC and AI workloads. The approach demarcates a new standard for architecture-specific analytical modeling, providing effective guidance for co-design and rapid adoption in emerging accelerator generations.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.