Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks (2507.10789v1)

Published 14 Jul 2025 in cs.DC

Abstract: The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU performance features with thought through microbenchmarks. We unveil key subsystems, including the memory hierarchy, SM execution pipelines, and the SM sub-core units, including the 5th generation tensor cores supporting FP4 and FP6 precisions. To understand the different key features of the NVIDIA GPU, we study latency, throughput, cache behavior, and scheduling details, revealing subtle tuning metrics in the design of Blackwell. To develop a comprehensive analysis, we compare the Blackwell architecture with the previous Hopper architecture by using the GeForce RTX 5080 and H100 PCIe, respectively. We evaluate and compare results, presenting both generational improvements and performance regressions. Additionally, we investigate the role of power efficiency and energy consumption under varied workloads. Our findings provide actionable insights for application developers, compiler writers, and performance engineers to optimize workloads on Blackwell-based platforms, and contribute new data to the growing research on GPU architectures.

Summary

  • The paper presents a detailed microarchitectural evaluation using custom PTX and CUDA benchmarks to highlight Blackwell's unified INT32/FP32 execution units and new low-precision tensor core support.
  • The study contrasts Blackwell with datacenter-oriented Hopper, revealing trade-offs in memory hierarchy design, execution unit configuration, and power efficiency for varied workloads.
  • The paper offers actionable insights for optimizing kernel performance and compiler strategies by dissecting instruction pipelines, memory access patterns, and warp scheduling.

Microarchitectural Analysis of NVIDIA Blackwell: Insights from Microbenchmarking

This paper presents a comprehensive microarchitectural evaluation of NVIDIA's Blackwell GPU architecture, with a focus on the consumer-grade "blue" chip (GeForce RTX 5080), and a comparative analysis against the datacenter-oriented Hopper (GH100, H100 PCIe). The paper employs a suite of custom microbenchmarks, written in PTX and CUDA, to dissect the execution pipelines, memory hierarchy, and tensor core subsystems, providing actionable data for developers, compiler writers, and performance engineers.

Architectural Overview and Methodology

The analysis is grounded in a direct comparison between Blackwell and Hopper, two architectures that, while sharing a similar CUDA programming model, diverge in execution unit configuration, memory hierarchy, and target workloads. Blackwell is positioned as a power-efficient, consumer-focused GPU, while Hopper is optimized for large-scale AI training and scientific computing.

Key architectural differences include:

  • Execution Units: Blackwell introduces unified INT32/FP32 execution units, reducing idle cycles in mixed workloads, while Hopper maintains separate pipelines.
  • Tensor Cores: Blackwell's 5th-generation tensor cores add native support for FP4 and FP6, extending beyond Hopper's FP8 capabilities.
  • Memory Hierarchy: Blackwell features a smaller L1/shared memory per SM but compensates with a larger, unified L2 cache. Hopper uses partitioned L2 and HBM2e memory for higher bandwidth and lower latency.

The microbenchmarks are carefully designed to avoid compiler optimizations, with PTX kernels executed at runtime and SASS code inspection to ensure instruction fidelity.

Compute Pipeline and Execution Behavior

INT32/FP32 and FP64 Units

  • Unified INT32/FP32 Cores: Blackwell's unified execution units demonstrate lower latency in mixed workloads compared to Hopper, which suffers from pipeline underutilization when instruction types are imbalanced. However, Hopper retains a slight advantage in pure INT32 or FP32 workloads.
  • FP64 Execution: Blackwell's reduction to two FP64 units per SM (from Hopper's 64) results in higher latency for double-precision workloads, indicating a clear trade-off favoring low-precision compute. This is a critical consideration for HPC applications requiring sustained FP64 throughput.

Warp Scheduling

  • Latency Hiding: Hopper exhibits superior latency hiding for short dependency chains, attributed to deeper instruction buffering and aggressive warp scheduling. Blackwell, in contrast, provides smoother throughput scaling with increasing ILP, favoring regular, high-ILP kernels.

Tensor Core Microarchitecture and Low-Precision Compute

Instruction Set and Datatype Support

  • FP4/FP6 Support: Blackwell's 5th-gen tensor cores natively support FP4 and FP6, with new PTX instructions (e.g., mma.sync.aligned.kind::f8f6f4). Hopper lacks these formats, supporting up to FP8.
  • Instruction Mapping: The translation of PTX to SASS instructions (OMMA, QMMA, HMMA) is validated, with Blackwell's software stack still evolving for full low-precision support.

Performance and Power Trade-offs

  • Throughput and Latency: Blackwell achieves higher throughput and lower latency for low-precision (FP4/FP6/FP8) tensor core operations, especially at high ILP and low warp counts. Hopper requires more concurrent warps to saturate execution units, reflecting a design optimized for bulk concurrency.
  • Power Efficiency: Blackwell demonstrates significantly lower power consumption for FP4 (16.75W) and FP6 (39–46W) workloads compared to Hopper's FP8 (55W), highlighting the architectural efficiency gains for inference and quantized workloads.

Memory Hierarchy: Latency, Bandwidth, and Scaling

Shared Memory and L1 Cache

  • Capacity and Latency: Hopper's larger shared memory (up to 256 KB/SM) and L1 cache provide better scalability under high warp pressure and strided access patterns. Blackwell, with 128 KB/SM, offers lower latency at low warp counts but is more sensitive to bank conflicts and partition saturation.
  • Warp Scaling: Both architectures show increased latency with more warps, but Hopper's design is more resilient under heavy load.

L2 Cache and Global Memory

  • Partitioning vs. Unification: Hopper's partitioned L2 cache delivers lower latency at low concurrency, while Blackwell's unified L2 (65 MB) maintains performance under extreme load, favoring bandwidth-bound applications.
  • Bandwidth: Hopper achieves higher peak read bandwidth (15.8 TB/s) and lower global memory latency, owing to HBM2e. Blackwell's GDDR7 is less performant in this regard, but the architecture is tuned for energy efficiency and mixed workloads.

Application-Level Case Studies

Dense GEMM

  • Throughput: Hopper consistently outperforms Blackwell in FP8 GEMM throughput (up to 0.887 TFLOP/s vs. 0.233 TFLOP/s for 8192³ matrices), with more stable kernel selection and lower runtime variability.
  • Power: Blackwell exhibits higher power draw and lower performance-per-watt for large GEMM kernels, despite theoretical improvements in low-precision compute.

Transformer Inference

  • Power Scaling: Blackwell demonstrates better power scaling with reduced precision (down to 45W in FP8), while Hopper maintains a flat power profile across precisions. This suggests Blackwell's suitability for energy-constrained inference scenarios, provided software and kernel selection are optimized.

Implications and Future Directions

The findings have several practical implications:

  • Kernel Optimization: Developers targeting Blackwell should prioritize high-ILP, low-precision workloads to exploit the architecture's strengths. Mixed INT32/FP32 pipelines and FP4/FP6 tensor core support can be leveraged for efficient inference and quantized models.
  • Precision Trade-offs: The reduced FP64 capability in Blackwell necessitates careful algorithmic design for HPC workloads, potentially requiring mixed-precision strategies or offloading to datacenter-class GPUs.
  • Memory Tuning: Memory-bound kernels must account for the smaller L1/shared memory and unified L2 cache in Blackwell, with attention to access patterns and warp scheduling to avoid contention.
  • Software Stack Maturity: The observed performance regressions in GEMM and kernel selection on Blackwell highlight the need for further compiler and library optimization to fully realize the hardware's potential.

Looking forward, the architectural trends in Blackwell—favoring low-precision, high-throughput compute, and energy efficiency—are likely to influence future GPU designs, especially as AI inference and edge deployment become more prevalent. The microbenchmarking methodology established in this work provides a robust framework for evaluating and tuning emerging architectures, and the detailed empirical data will inform both hardware and software co-design in the evolving GPU landscape.