Blackwell GPU Architecture
- Blackwell GPU Architecture is a next-generation design that features fifth-generation tensor cores, dedicated 256 KB TMEM per SM, and a hardware decompression engine for efficient data processing.
- It restructures the memory hierarchy with a larger unified L2 cache and a configurable L1/shared memory setup to enhance performance in dense and sparse matrix computations.
- Comparative analysis with Hopper and H200 shows improved energy efficiency and lower latency in mixed-precision inference and transformer workloads, despite some DRAM bandwidth trade-offs.
NVIDIA Blackwell is a GPU architecture introduced to address the computational demands of exascale simulation, AI model training, inference, and high-throughput scientific computing. Blackwell advances GPU design with fifth-generation tensor cores, tensor memory (TMEM), a hardware decompression engine (DE), unified large L2 caches, and warp-level scheduling innovations, compared to Hopper and H200. Research groups have characterized Blackwell using extensive microbenchmarking methodologies to rigorously dissect architectural features and their impact across dense and sparse matrix multiplication, transformer workloads, and mixed-precision inference and training (Jarmusch et al., 14 Jul 2025, Jarmusch et al., 1 Dec 2025).
1. Microarchitectural Building Blocks
The Blackwell architecture centers on several novel subsystems that drive its computational and memory efficiency:
- Fifth-Generation Tensor Core Pipeline: Employs warp-level, single-thread MMA instructions with the new PTX opcode
tcgen05.mma, supplanting Hopper’swgmma. Each thread issues MMA operations independently, eliminating warp-wide instruction barriers. Instruction latency is nearly constant (~11 cycles) across tile sizes and numeric precisions, with supported formats including FP64, TF32, BF16, FP16, FP8, FP6, FP4, and INT8, which are mapped to corresponding SASS instructions (DMMA, HMMA, QMMA, OMMA, IMMA). - Tensor Memory (TMEM): Each SM integrates a dedicated 256 KB TMEM (512 columns × 128 lanes × 32 bits) accessed via data-path-specific instructions (
tcgen05.cp,tcgen05.ld,tcgen05.st). TMEM read bandwidth reaches 16 TB/s per SM, and write bandwidth 8 TB/s, additive to traditional shared memory. TMEM significantly reduces cache miss latency (58% reduction)—measured at ≈420 cycles vs H200’s ≈1000 cycles. - Hardware Decompression Engine (DE): Native acceleration for LZ4, Snappy, Zstandard, GZIP, Cascaded, Bitcomp, and ANS, delivering sub-millisecond latency on 100 MB blocks and throughput up to 539 GB/s. DE exhibits input-bandwidth variability based on compression ratio and parallelizes well with saturation batch sizes.
- Dual-Chip Arrangement and Interconnect: The B200 utilizes two identical dies (74 SMs each, totaling 148 SMs per GPU), eight GPCs, four L2 partitions per die, and eight HBM3e stacks per die. The address space is unified at 192 GB, and die-to-die coherency is maintained via NV-HBI (≈1.5 TB/s link).
2. Memory Hierarchy and Architectural Organization
Blackwell features a restructured memory hierarchy compared to Hopper and H200:
- L1 Data Cache & Shared Memory: Configurable unified L1/shared memory per SM—128 KB on Blackwell, reduced from 256 KB on Hopper. TMEM augments this L1/SMEM.
- L2 Cache: Blackwell provides a 64–65 MB monolithic (or partitioned) L2 cache, up from Hopper’s 50 MB (2×25 MB). B200 uses 8×8 MB L2 slices, private to partitions.
- Global Memory: RTX 5080 (Blackwell) employs 16 GB GDDR7, while B200 leverages 192 GB HBM3e across dual dies.
- Cache Line and Policy: Cache line size is 128 bytes. Observed data access patterns indicate approximate LRU replacement (L1) and pseudo-LRU (L2), although policies are undocumented.
- Bandwidth and Latencies (Measured):
| Subsystem | Blackwell Latency (cycles) | Hopper/H200 Latency (cycles) | |---------------|----------------------------|------------------------------| | L1 Hit | ≈22–40 | ≈30–40 | | L2 Hit | ≈128–358 | ≈273 (Hopper); ≈128 (H200) | | TMEM Miss | ≈420 | — | | DRAM Access | ≈877 (5080) / ≈4200 (B200) | ≈659 (Hopper)/≈1000 (H200) |
Sustained DRAM bandwidth (microbenchmarked):
| Subsystem | Blackwell (TB/s) | Hopper/H200 (TB/s) | |-------------|------------------|--------------------| | Read | 8.2 (5080), 8.4 (B200) | 15.8 (Hopper), 4.38 (H200) | | Write | 1.6 (5080) | 2.2 (Hopper) |
Effective bandwidth is governed by cache miss rate :
3. SM Execution and Scheduling Machinery
The Streaming Multiprocessor (SM) pipeline in Blackwell is characterized by these features:
- Warp Scheduler: Each SM hosts 4 warp schedulers; each can issue up to 2 independent instructions per cycle, with a peak issue width of 8 scalar ops (INT32/FP32) or 4 tensor ops.
- ALU Throughput and Latency: Unified INT32/FP32 ALU clusters deliver 4-way fused-MAC per cycle per lane; true dependency latency for INT32/FP32 is 4 cycles, FP64 is ~32–64 cycles (emulated), while tensor core MMA is ~1.21 cycles/tile for FP4 (see Table below).
- Tensor Core Throughput: For FP4, measured throughput achieves ~11 TFLOP/s per SM at 25 warps with ILP=6.
- Performance Model: Per-SM throughput is given by:
where is the SM clock (2.4–2.65 GHz), IPC is instruction per cycle per warp scheduler, and is active warps per SM (≤64). Global throughput .
- Resource Contention: Instruction issue rates degrade with increased warps due to register file and shared-L1 contention.
Representative Latency and Throughput Table:
| Operation | True Latency (cycles) | Completion Latency (cycles) | Throughput (ops/cycle/SM) |
|---|---|---|---|
| INT32 (mad) | 4 | 16.97 | 0.25 |
| FP32 (fma) | 4 | 7.97 | 0.50 |
| FP64 | ~37.5 | 37–64 (pipelined) | 0.054 |
| MMA (FP4, tile) | ~1.21 | — | see tensor core table |
4. Tensor Core Enhancements and Mixed-Precision Capabilities
The 5th-generation tensor cores are distinctive due to:
- Precision Proliferation: Support for FP4 (E2M1), FP6 (E3M2), FP8 (E4M3), INT8, FP16, BF16, TF32, and FP64 in a unified pipeline. FP4 and FP6 are new in Blackwell, with hardware datapaths absent in Hopper/H200.
- Warp-Level Dispatch: MMA instructions are now warp-level and independent, reducing instruction latency by 2.9–11.6× vs Hopper’s warp-group dispatch.
- Measured Throughput and Latency: Peak TFLOPS scale linearly with increased subword parallelism; FP4 achieves 7702.5 TFLOPS, FP6 5134.8 TFLOPS at 2.4 GHz.
- Energy Efficiency: Blackwell achieves 1.56× higher mixed-precision throughput and 42% better energy efficiency in end-to-end transformer training compared to H200 (Jarmusch et al., 1 Dec 2025).
| Precision | TFLOPS Blackwell | TFLOPS Hopper/H200 | Speedup |
|---|---|---|---|
| FP64 | 44.8 | 34.0 | 1.32× |
| FP32 | 481.2 | 378.4 | 1.27× |
| FP8 | 3851.4 | 3026.9 | 1.27× |
| FP6 | 5134.8 | N/A | New |
| FP4 | 7702.5 | N/A | New |
| INT8 | 3927.1 | 3088.4 | 1.27× |
5. Comparative Analysis with Hopper and H200 Architectures
Blackwell introduces marked changes relative to prior NVIDIA architectures:
- Memory Hierarchy: Blackwell’s L2 cache is +30% larger than Hopper, but L1 is reduced by 50%. B200’s TMEM and 8-way L2 partitioning decrease cross-traffic and cache miss latencies compared to H200.
- Bandwidth: While Blackwell’s memory bandwidth is lower on some RTX 5080 variants (e.g., 8.2 TB/s read vs 15.8 TB/s Hopper), B200’s dual-chip design achieves ~8.4 TB/s off-chip bandwidth, with STREAM microbenchmarks measuring 7.48 TB/s (1.71× H200).
- GEMM Throughput: On large GEMM (8192³), Blackwell displays lower throughput (0.233 TFLOP/s vs 0.887 TFLOP/s Hopper) and higher peak power (114 W vs 68 W). Performance regressions in large GEMM and FP8 workloads are linked to immature driver/compiler kernels (cuBLASLt) on Blackwell (Jarmusch et al., 14 Jul 2025).
- Energy Efficiency: Blackwell’s transformer inference (FP32, FP16, FP8) consistently draws less power than Hopper (e.g., 45.1 W vs 57–60 W for FP8).
- Decompression Engine: Hardware DE delivers 50–200× speedup over software decompression in real workloads.
6. Workload Optimization and Practical Tuning Strategies
Empirical benchmarks and architectural analyses yield actionable recommendations:
- Dense GEMM: Tile operands as 64×64 for maximal TMEM bandwidth. Prefetch tiles via
tcgen05.cpduring ongoing MMA computation. Intermediate accumulates should reside in TMEM, reducing L2/global traffic. - Sparse GEMM / SpMV: Store blocks in compressed form in HBM (RLE/Bitcomp), with on-the-fly decompression in the DE. Stream decompressed rows into TMEM for reduction.
- Transformer Inference: Quantize weights to FP8 or FP4 (E4M3/E2M1); activations to FP16. Employ DE for loading quantized weights into TMEM, retain attention and intermediate results in TMEM.
- Mixed-Precision Training: BF16/FP16 accumulations are preferred for stability; full FP32 incurs ~50% perf penalty. Co-schedule CTAs in TPC pairs to share TMEM-resident data. Partition workload to balance shared memory (L1) and TMEM resources: smaller layers favor L1, large layers TMEM.
- Thread Block Scheduling: Blocks mapped in round-robin fashion until SM resource exhaustion. Best latency hiding and ILP at 32–48 warps/SM and ILP≥4.
7. Context, Limitations, and Future Implications
Blackwell’s innovations—5th-gen tensor cores, TMEM, DE, unified L2, and dual-chip approaches—alter memory-bound kernel design and mixed- and ultra-low precision inference. However, limitations are evident: DRAM bandwidth regressions on RTX 5080 and performance regressions in FP8 GEMM workloads highlight the impact of compiler/runtime maturity (cuBLASLt), and a reduction in L1 cache versus Hopper must be compensated by TMEM-aware algorithms. Researchers suggest a plausible implication is that future workload optimizations will increasingly depend on maximizing TMEM use and hardware DE integration to offset L1/L2 constraints and maximize throughput at low precision. Blackwell offers measurable improvements in mixed-precision AI, energy efficiency, and decompression acceleration, positioning it as an architecture suitable for both light-warp scenarios and large-scale exascale computing deployments (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).