- The paper presents a detailed microarchitectural evaluation using custom PTX and CUDA benchmarks to highlight Blackwell's unified INT32/FP32 execution units and new low-precision tensor core support.
- The study contrasts Blackwell with datacenter-oriented Hopper, revealing trade-offs in memory hierarchy design, execution unit configuration, and power efficiency for varied workloads.
- The paper offers actionable insights for optimizing kernel performance and compiler strategies by dissecting instruction pipelines, memory access patterns, and warp scheduling.
Microarchitectural Analysis of NVIDIA Blackwell: Insights from Microbenchmarking
This paper presents a comprehensive microarchitectural evaluation of NVIDIA's Blackwell GPU architecture, with a focus on the consumer-grade "blue" chip (GeForce RTX 5080), and a comparative analysis against the datacenter-oriented Hopper (GH100, H100 PCIe). The paper employs a suite of custom microbenchmarks, written in PTX and CUDA, to dissect the execution pipelines, memory hierarchy, and tensor core subsystems, providing actionable data for developers, compiler writers, and performance engineers.
Architectural Overview and Methodology
The analysis is grounded in a direct comparison between Blackwell and Hopper, two architectures that, while sharing a similar CUDA programming model, diverge in execution unit configuration, memory hierarchy, and target workloads. Blackwell is positioned as a power-efficient, consumer-focused GPU, while Hopper is optimized for large-scale AI training and scientific computing.
Key architectural differences include:
- Execution Units: Blackwell introduces unified INT32/FP32 execution units, reducing idle cycles in mixed workloads, while Hopper maintains separate pipelines.
- Tensor Cores: Blackwell's 5th-generation tensor cores add native support for FP4 and FP6, extending beyond Hopper's FP8 capabilities.
- Memory Hierarchy: Blackwell features a smaller L1/shared memory per SM but compensates with a larger, unified L2 cache. Hopper uses partitioned L2 and HBM2e memory for higher bandwidth and lower latency.
The microbenchmarks are carefully designed to avoid compiler optimizations, with PTX kernels executed at runtime and SASS code inspection to ensure instruction fidelity.
Compute Pipeline and Execution Behavior
INT32/FP32 and FP64 Units
- Unified INT32/FP32 Cores: Blackwell's unified execution units demonstrate lower latency in mixed workloads compared to Hopper, which suffers from pipeline underutilization when instruction types are imbalanced. However, Hopper retains a slight advantage in pure INT32 or FP32 workloads.
- FP64 Execution: Blackwell's reduction to two FP64 units per SM (from Hopper's 64) results in higher latency for double-precision workloads, indicating a clear trade-off favoring low-precision compute. This is a critical consideration for HPC applications requiring sustained FP64 throughput.
Warp Scheduling
- Latency Hiding: Hopper exhibits superior latency hiding for short dependency chains, attributed to deeper instruction buffering and aggressive warp scheduling. Blackwell, in contrast, provides smoother throughput scaling with increasing ILP, favoring regular, high-ILP kernels.
Tensor Core Microarchitecture and Low-Precision Compute
Instruction Set and Datatype Support
- FP4/FP6 Support: Blackwell's 5th-gen tensor cores natively support FP4 and FP6, with new PTX instructions (e.g.,
mma.sync.aligned.kind::f8f6f4
). Hopper lacks these formats, supporting up to FP8.
- Instruction Mapping: The translation of PTX to SASS instructions (OMMA, QMMA, HMMA) is validated, with Blackwell's software stack still evolving for full low-precision support.
- Throughput and Latency: Blackwell achieves higher throughput and lower latency for low-precision (FP4/FP6/FP8) tensor core operations, especially at high ILP and low warp counts. Hopper requires more concurrent warps to saturate execution units, reflecting a design optimized for bulk concurrency.
- Power Efficiency: Blackwell demonstrates significantly lower power consumption for FP4 (16.75W) and FP6 (39–46W) workloads compared to Hopper's FP8 (55W), highlighting the architectural efficiency gains for inference and quantized workloads.
Memory Hierarchy: Latency, Bandwidth, and Scaling
Shared Memory and L1 Cache
- Capacity and Latency: Hopper's larger shared memory (up to 256 KB/SM) and L1 cache provide better scalability under high warp pressure and strided access patterns. Blackwell, with 128 KB/SM, offers lower latency at low warp counts but is more sensitive to bank conflicts and partition saturation.
- Warp Scaling: Both architectures show increased latency with more warps, but Hopper's design is more resilient under heavy load.
L2 Cache and Global Memory
- Partitioning vs. Unification: Hopper's partitioned L2 cache delivers lower latency at low concurrency, while Blackwell's unified L2 (65 MB) maintains performance under extreme load, favoring bandwidth-bound applications.
- Bandwidth: Hopper achieves higher peak read bandwidth (15.8 TB/s) and lower global memory latency, owing to HBM2e. Blackwell's GDDR7 is less performant in this regard, but the architecture is tuned for energy efficiency and mixed workloads.
Application-Level Case Studies
Dense GEMM
- Throughput: Hopper consistently outperforms Blackwell in FP8 GEMM throughput (up to 0.887 TFLOP/s vs. 0.233 TFLOP/s for 8192³ matrices), with more stable kernel selection and lower runtime variability.
- Power: Blackwell exhibits higher power draw and lower performance-per-watt for large GEMM kernels, despite theoretical improvements in low-precision compute.
- Power Scaling: Blackwell demonstrates better power scaling with reduced precision (down to 45W in FP8), while Hopper maintains a flat power profile across precisions. This suggests Blackwell's suitability for energy-constrained inference scenarios, provided software and kernel selection are optimized.
Implications and Future Directions
The findings have several practical implications:
- Kernel Optimization: Developers targeting Blackwell should prioritize high-ILP, low-precision workloads to exploit the architecture's strengths. Mixed INT32/FP32 pipelines and FP4/FP6 tensor core support can be leveraged for efficient inference and quantized models.
- Precision Trade-offs: The reduced FP64 capability in Blackwell necessitates careful algorithmic design for HPC workloads, potentially requiring mixed-precision strategies or offloading to datacenter-class GPUs.
- Memory Tuning: Memory-bound kernels must account for the smaller L1/shared memory and unified L2 cache in Blackwell, with attention to access patterns and warp scheduling to avoid contention.
- Software Stack Maturity: The observed performance regressions in GEMM and kernel selection on Blackwell highlight the need for further compiler and library optimization to fully realize the hardware's potential.
Looking forward, the architectural trends in Blackwell—favoring low-precision, high-throughput compute, and energy efficiency—are likely to influence future GPU designs, especially as AI inference and edge deployment become more prevalent. The microbenchmarking methodology established in this work provides a robust framework for evaluating and tuning emerging architectures, and the detailed empirical data will inform both hardware and software co-design in the evolving GPU landscape.