Unified INT32/FP32 Execution Unit in Blackwell

Updated 27 November 2025

The unified INT32/FP32 execution unit is a 64-lane scalar ALU that dynamically executes FP32 and INT32 operations in NVIDIA Blackwell GPUs.
It time-multiplexes FP32 fused-multiply-add and INT32 multiply-add instructions per cycle, reducing the physical resource duplication seen in earlier architectures.
Microbenchmarks reveal that Blackwell’s design nearly halves mixed-workload latency compared to Hopper, significantly enhancing throughput and ALU occupancy.

A unified INT32/FP32 execution unit is a scalar arithmetic logic unit (ALU) cluster capable of dynamically executing both INT32 and FP32 operations within each cycle, sharing a common physical resource. In the NVIDIA Blackwell GPU architecture, this cluster replaces the separate INT32 and FP32 pipelines of earlier architectures, time-multiplexing execution such that either an FP32 fused-multiply-add (FMA) or an INT32 multiply-add (MAD) instruction can issue per cycle, but never both. This architectural shift aims to improve the utilization and efficiency of ALU resources, particularly for workloads that interleave integer and floating-point operations, and is characterized by specific changes in instruction scheduling, true and completion latencies, throughput, and mixed-type contention characteristics (Jarmusch et al., 14 Jul 2025).

1. Streaming Multiprocessor Sub-core Topology

The Blackwell Streaming Multiprocessor (SM) features a single unified INT32/FP32 ALU cluster, two dedicated FP64 pipelines, and four 5th-generation tensor-core sub-cores (supporting emergent precision types like FP4, FP6, FP8). This marks a departure from the Hopper (GH100) architecture, which contained 128 separate FP32 pipelines and 64 INT32 pipelines per SM, along with 64 FP64 and 4 4th-generation tensor cores.

Architecture	FP32 ALUs	INT32 ALUs	FP64 ALUs	Tensor Cores
Hopper (GH100)	128	64	64	4 (4th-gen)
Blackwell (RTX 5080)	Unified (64 lanes)	Unified (64 lanes)	2	4 (5th-gen)

This unification reduces physical resource duplication and alters the microarchitectural pipeline for integer and floating-point computations.

2. Issuing Logic and Pipeline Multiplexing

On Hopper, separate pipelines enabled simultaneous dispatch of FP32 and INT32 instructions. In Blackwell, only one instruction type—either FP32 or INT32—can be decoded and executed per cycle within the unified cluster. The 64-lane scalar ALU cluster is time-shared; hardware dynamically decodes the opcode in each cycle and steers the operation accordingly. This design ensures that pure sequences of a single type can saturate the cluster, while mixed sequences must serialize access, though at lower stall penalties than in the previous architecture.

3. Latency and Throughput Characteristics

Under carefully controlled microbenchmarks, both Blackwell and Hopper exhibit the same pipeline depth for pure-type instruction chains: a true latency of 4 cycles for both INT32 (MAD) and FP32 (FMA). However, completion latency—the number of cycles until an independent instruction can retire—differs: 16.97 cycles for INT32 and 7.97 cycles for FP32 on Blackwell. Interleaved (mixed-type) chains display significantly different behaviors:

GPU	Pure INT32 (True/Comp.)	Pure FP32 (True/Comp.)	Mixed 1:1 (True/Comp.)	Mixed 2:1 (True/Comp.)
Blackwell	4 / 16.97	4 / 7.97	15.96 / 14.0	26.28 / 18.0
Hopper (GH100)	4 / 16.69	4 / 7.86	31.62 / 16.0	43.54 / 20.0

Mixed-type instruction chains on Blackwell increase true latency by a factor of 4–6× (to 16–26 cycles), while Hopper suffers an 8–11× blow-up (32–44 cycles). Thus, Blackwell substantially reduces the performance penalty for mixed FP32/INT32 workloads.

Theoretical throughput for an SM is defined as:

$T_{fp32} = \frac{N_{fp32\_units} \times f_{clk}}{L_{fp32\_completion}}$

where $N_{fp32\_units} = 64$ , $f_{clk}$ is the cluster's clock rate (e.g., 2.2 GHz), and $L_{fp32\_completion}$ is 8 cycles for pure FP32.

Instruction-level parallelism (ILP) drives Blackwell’s unified cluster to approximately 1 FMA/op (FP32) or 1 MAD/op (INT32) per cycle per SM when fully saturated, with up to 64 scalar ops/cycle.

4. Hardware Allocation and Pipeline Stages

The Blackwell SM sub-core allocates major execution resources as follows:

Warp scheduler and ISA decode unit
Unified INT32/FP32 cluster: 64-lane scalar ALU
Two dedicated FP64 pipelines per SM
Four tensor-core sub-cores

A schematic representation:

+--------------------------------+
|  Warp scheduler / ISA decode   |
+--------------------------------+
                    |
     +-----------------------------+-----------------------------+
     |                                                           |
+--------+                                             +-------------+
| Scalar |    ← unified INT32/FP32 cluster (64 ALUs)   | Tensor Core |
|  ALUs  |                                             +-------------+
+--------+
     |
+--------+
| FP64   |   ← 2 dedicated FP64 pipelines
| ALUs   |
+--------+

The decode and scheduling logic must guarantee that no more than one FP32/INT32 instruction issues per cycle within the unified ALU cluster.

5. Microbenchmark Methodology

Microbenchmarks implemented in PTX are utilized to avoid confounding effects of CUDA-compiler optimizations. Kernels explicitly select either mad.lo.s32 (INT32) or fma.rn.f32 (FP32), and the SASS (assembly) is verified for one-to-one mapping with source instructions. True and completion latencies are measured using the %clock64 performance counter, with kernel chains varying in length and repeated 1024 times to gather average performance data. Throughput and ILP are evaluated via serialized chains with controlled loop trip counts.

6. Resource Contention and Workload Scheduling

When both INT32 and FP32 instructions are present, resource contention occurs as the unified cluster cannot issue both in the same cycle. On Hopper, this results in one type stalling while the other is in use, which manifests as a near-doubling of true latency. In contrast, Blackwell's unified design, with dynamic switching at each cycle, halves the mixed-type penalty: mixed-type workloads pay only a ~4× true-latency penalty, not 8× as seen in Hopper.

For pure workloads, Blackwell maintains unchanged true latency and peak throughput relative to its predecessor. Under high ILP, Blackwell’s throughput (IPC) scales smoothly with chain length and warp-fanout. Hopper’s scheduler, by contrast, exhibits higher variance under varying chain-lengths and mixed-type mixes. A plausible implication is that the Blackwell design offers superior resource efficiency for mixed workloads without sacrificing pure workload performance.

7. Implications and Architectural Significance

Replacing separate INT32 and FP32 pipelines with a single, unified 64-lane ALU cluster per SM in Blackwell maximizes utilization for mixed workloads while incurring only a moderate increase in scheduling complexity and modest mixed-type latency penalties. Both pure INT32 and FP32 workloads retain a true latency of 4 cycles. Completion latencies remain comparable to Hopper: ≈8 cycles for FP32, ≈17 cycles for INT32. Mixed sequences double true latency to ≈16–26 cycles but still outperform Hopper’s 32–44 cycle range. This design is significant in that it cuts mixed-type pipeline penalties roughly in half and enables better ALU occupancy in heterogeneous compute scenarios (Jarmusch et al., 14 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Unified INT32/FP32 Execution Unit.