Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

NVIDIA Blackwell Architecture

Updated 16 July 2025
  • NVIDIA Blackwell architecture is a GPU microarchitecture featuring redesigned memory subsystems, unified execution pipelines, and advanced 5th-generation tensor cores.
  • The design optimizes latency and throughput with a reduced L1 cache per SM, an expanded monolithic L2 cache, and improved scheduling for mixed workloads.
  • Innovations such as unified INT32/FP32 pipelines and support for low-precision tensor cores (FP4/FP6) enable enhanced performance-per-watt across diverse applications.

The NVIDIA Blackwell architecture represents a significant evolution within GPU microarchitecture, characterized by substantial redesigns of its memory subsystem, execution pipelines, and sub-core units. Developed for high-throughput scientific and consumer workloads, the architecture underlying the GeForce RTX 5080—designated as Blackwell or "blue"—has been comprehensively analyzed through microbenchmarks, with particular attention to its memory hierarchy, streaming multiprocessor (SM) design, tensor core capabilities, and comparative performance and power characteristics relative to the prior generation Hopper architecture. Key findings elucidate how architectural choices in Blackwell influence latency, throughput, power efficiency, and optimal workload strategies (Jarmusch et al., 14 Jul 2025).

1. Architectural Subsystems

Memory Hierarchy

The memory hierarchy in Blackwell is defined by a unified memory subsystem per SM, wherein shared memory and the L1 cache share hardware resources. Relative to Hopper, Blackwell’s L1 cache per SM is reduced to 128 KB (from Hopper's 256 KB/SM), but this reduction is offset by an increased, monolithic L2 cache of 65 MB (contrasted with Hopper’s 50 MB L2 distributed across two partitions).

  • At low warp counts, shared memory access latency is lower on Blackwell.
  • With increased concurrency (via higher warp counts or greater stride values in memory accesses), the reduced size and unified partitioning of shared memory in Blackwell induce steeper latency growth and increased bank conflicts relative to Hopper.

1
2
3
4
5
6
7
8
9
10
11
12
+-----------------------------+
|       Global Memory         |
+-------------+---------------+
              │
     +-------------------------+
     |      L2 Cache (65 MB)   |
     +-------------------------+
              │
     +-----------------------+
     | L1 Cache/Shared Mem   |
     |   (128 KB per SM)     |
     +-----------------------+

Streaming Multiprocessor (SM) Execution Pipelines

Blackwell’s SM follows the established NVIDIA warp scheduling and instruction issue paradigm but introduces scheduling improvements for divergent workloads. A defining feature is the use of unified execution pipelines for INT32 and FP32 operations; both operate via the same hardware units, allowing mixed workloads to achieve greater efficiency by reducing idle cycles. Empirical latency measurements indicate four cycles (true latency) for both pure INT32 and FP32 on Blackwell and Hopper, but mixed workloads see lower completion latency and improved throughput on Blackwell.

SM Sub-Core Units and 5th-Generation Tensor Cores

A principal refinement in Blackwell lies within its SM sub-core units, namely the introduction of 5th-generation tensor cores. These units extend hardware support to lower-precision formats FP4 and FP6, supplementing the existing FP8 capability.

  • Tensor core warp-group instructions on Blackwell transition from Hopper’s “wgmma” to “tcgen05” at the SASS level.
  • There exists transitional ambiguity; some FP4 instructions map to QMMA until broader software support is achieved.
  • MMA instructions are now extensible with precision suffixes, e.g., mma.sync.aligned.m16n8k32.f32.f16.f16.f32.kind::f8f6f4.
  • Tiling support accommodates diverse matrix shapes (such as m16n8k32, m8n8k16), exploiting higher instruction-level parallelism (ILP) to bolster throughput, especially under conditions of reduced warp activity.

Latency and Throughput

Latency is measured as both true latency (sequential instruction cycle count without overlap) and completion latency (when instructions are overlapped). Blackwell demonstrates lower completion latency for mixed INT32/FP32 workloads, an effect attributed to its unified pipeline.

  • Throughput, quantified as instructions per clock cycle per SM, is enhanced via Blackwell’s improved scheduling and unified execution resources, especially prominent in low-precision tensor core workloads (FP4/FP6). For sufficiently high ILP, Blackwell sustains throughput surpassing 11 TFLOP/s.

Cache Effects and Tuning Curves

  • Pointer-chase and strided-access microbenchmarks elucidate cache boundary behaviors. L1/shared memory latencies typically span 30–40 cycles for both architectures, though Blackwell’s smaller L1 induces more rapid latency growth with increased warp activity.
  • L2 cache performance varies with design: Hopper’s dual-partition structure achieves lower latency (~273 cycles) under moderate concurrency, while Blackwell’s single large L2 exhibits uniform higher latency (~358 cycles) that stabilizes at greater working set sizes.
  • Hopper more effectively "hides" latency in short dependency contexts, favoring latency-sensitive workloads. Blackwell is optimized for kernels with high ILP, where instruction overlap can be better exploited.

3. Comparative Evaluation: Blackwell Versus Hopper

Architectural and Functional Contrasts

The GeForce RTX 5080 (Blackwell) and H100 PCIe (Hopper) provide the basis for cross-generational comparison.

  • Hopper possesses a greater number of dedicated FP64 units (64 vs. 2 per SM in Blackwell) and separates INT32 and FP32 execution paths, whereas Blackwell unifies these resources, resulting in increased flexibility under mixed workload scenarios.
  • Blackwell’s generational advancements in tensor cores permit lower precision arithmetic (FP4, FP6), expanding the range of efficient computation, particularly for inference and consumer applications.

Comparative Performance

  • Despite improvements, dense GEMM microbenchmarks reveal that Hopper (GH100) achieves superior throughput for larger operands, likely a result of more mature compiler heuristics and optimized scheduling.
  • Hopper’s L2 design yields lower latency and improved memory throughput under modest concurrency, but at higher loads Blackwell’s expanded L2 cache allows it to match or surpass Hopper’s aggregate bandwidth.
  • Hopper is preferred in use cases demanding high FP64 throughput, deep buffering, and aggressive scheduling; Blackwell excels where power sensitivity and low-precision arithmetic are prioritized.
Aspect Blackwell (GeForce RTX 5080) Hopper (H100 PCIe)
L1 Cache/SM 128 KB 256 KB
L2 Cache 65 MB (monolithic) 50 MB (dual partition)
FP64 Units/SM 2 64
Tensor Core Precision FP4, FP6, FP8 FP8 (no native FP4/FP6)

4. Power Efficiency and Energy Consumption

Blackwell’s advances in power efficiency stem from both architectural and data format innovations:

  • For tensor core operations, FP4 workloads consume significantly less power (~16.75 W) compared to FP6 (~39–46 W) and FP8 (~46.66 W). Hopper, which lacks native support for FP4/FP6, operates at higher sustained consumption for FP8 workloads (~55 W).
  • Realistic workloads—such as dense GEMM and Transformer inference—highlight variability in Blackwell’s power draw. Peak consumption can exceed 114 W for large matrix sizes, whereas Hopper maintains a more stable range near 58–60 W.
  • While Blackwell can achieve superior efficiency via low-precision computation (e.g., Transformer inference in FP8 sees power diminish from 58.8 to 45 W), Hopper can offer higher overall performance-per-watt in high-throughput scenarios.

5. Optimization Strategies and Actionable Recommendations

Blackwell’s architecture requires precise application and kernel design to extract peak performance:

  • Application Development: Kernels should be restructured to maximize instruction-level parallelism (ILP), taking advantage of the unified INT32/FP32 units and advanced tensor cores. For memory-bound code, arrangements minimizing bank conflicts and exploiting the limited size of the unified L1/shared memory are critical.
  • Compiler Design: Scheduling heuristics should prioritize the interleaving of mixed precision instructions to exploit Blackwell’s lower completion latency. Compilers should implement auto-tuning for tensor core tile sizes, using explicit low-precision suffixes (e.g., .kind::f8f6f4) to select optimal MMAs for FP4, FP6, and FP8.
  • Performance Engineering: Microbenchmark-informed optimization (focusing on warp occupancy and ILP) is recommended. Analyzing latency/throughput curves can reveal both cache and sub-core bottlenecks. Power profiling under varying precisions and batch sizes assists in selecting the best performance-per-watt configurations.

6. Performance Measurement and Calculation

A central performance metric for Blackwell’s evaluation is the achieved throughput in dense matrix-matrix multiplication (GEMM), calculable as:

TFLOPS=2×M×N×Kruntime\text{TFLOPS} = \frac{2 \times M \times N \times K}{\text{runtime}}

where MM, NN, and KK are the matrix dimensions and runtime is in seconds. This parameter allows normalized comparison across architectures and precision formats.

7. Significance, Implications, and Limitations

The NVIDIA Blackwell architecture delivers notable generational progress, particularly for low-precision arithmetic and selective energy efficiency. Unified execution pipelines and 5th-generation tensor cores with support for FP4/FP6 are distinguishing features tailored for emerging AI inference and consumer workloads.

Nonetheless, persistent trade-offs are evident. Hopper’s superiority for FP64-dominated, high-latency-hiding workloads and its retention of higher L1 per SM benefit complex HPC and training tasks. The more pronounced increases in latency under high concurrency in Blackwell caution against direct transplantation of Hopper-optimized workloads without retuning.

This suggests that developers, compiler engineers, and performance analysts must adapt design and optimization practices to fully exploit Blackwell’s strengths, particularly by increasing instruction-level parallelism, fine-tuning memory access patterns, and targeting low-precision compute where feasible.

In summary, Blackwell-based platforms present new avenues for application and system optimization, particularly where high throughput and energy efficiency at low precision are paramount, while also reinforcing the continued importance of architecture-specific, empirically grounded tuning and benchmarking for state-of-the-art GPU computing (Jarmusch et al., 14 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)