Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Unified INT32/FP32 Execution Units Overview

Updated 17 July 2025
  • Unified INT32/FP32 execution units are hardware constructs that seamlessly integrate 32-bit integer and single-precision floating-point operations to optimize resource sharing.
  • They employ techniques such as time-multiplexing, dual-issue scheduling, and dynamic reconfiguration to achieve high throughput and energy efficiency.
  • These units are vital for AI, scientific computing, and embedded systems, delivering enhanced performance under tight power and area constraints.

A unified INT32/FP32 execution unit is a hardware or architectural construct capable of efficiently processing both 32-bit integer (INT32) and 32-bit single-precision floating-point (FP32) operations—often with mechanisms to allow rapid switching, concurrent execution, or dynamic sharing of compute resources between the two arithmetic domains. As intensive AI workloads, scientific computing, and general-purpose accelerators demand higher performance under tight power and area budgets, the need for such flexible and efficient compute structures has become increasingly pronounced. Approaches to unification span circuit-level integration, reconfigurable datapaths, innovative scheduling and ISA extensions, and software-hardware co-design.

1. Unified Execution Models: Key Concepts and Approaches

Several mechanisms enable unified INT32/FP32 execution:

  • Physical Sharing: Combining datapath segments (such as multipliers and adders) to enable time-multiplexed execution of FP32 and INT32 operations on the same physical logic, sometimes aided by dynamic reconfiguration (Lokhande et al., 16 Dec 2024).
  • Dual-Issue Architectures: Facilitating simultaneous (or highly overlapped) execution of independent integer and floating-point instructions using pipelined or decoupled threads on in-order or out-of-order cores (Colagrande et al., 26 Mar 2025).
  • Multi-Precision Processing: Employing flexible arithmetic units (e.g., CORDIC-based designs) that support a variety of precisions and numerical representations—including fixed-point and integer—via runtime selection signals (Lokhande et al., 16 Dec 2024).
  • Intermediate Representation: Leveraging frameworks such as dynamic fixed point (DFP), shared exponents, and block floating-point, which allow INT arithmetic to be used while preserving essential FP properties (range, scaling) for machine learning operations (Das et al., 2018).

A plausible implication is that by blurring the distinction between FP32 and INT32 at the execution level, unified units can exploit both the throughput and energy efficiency typically associated with integer operations, while retaining the numerical flexibility required for FP workloads.

2. Hardware Implementations and Microarchitectural Design

Hardware implementations of unified execution units span a variety of devices and platforms:

  • FPGA-Based Soft Processors: Architectures such as eGPU integrate both IEEE754 FP32 and INT32 units within each scalar processor, sharing register files, memory, and control paths. FP32 multiply-add operations are mapped onto dedicated DSP Blocks, while INT32 operations are handled via a mix of soft logic (ALMs) and fractional DSP usage (Langhammer et al., 2023).
  • SIMD/MIMD Reconfigurability: FlexPE demonstrates a CORDIC-based ALU architecture reconfigurable at runtime via control signals to support both integer MACs and non-linear FP32-like activation functions (including Sigmoid, Tanh, ReLU, Softmax), with multi-precision SIMD support (4/8/16/32 bits) to optimize pipeline efficiency and area utilization (Lokhande et al., 16 Dec 2024).
  • CPU and GPGPU Architectures: On CPUs or custom GPGPUs, AVX512_4VNNI and similar vector instruction sets provide unified integer FMA pipelines, with the possibility of dynamically mapping FP32 workloads to these paths via casting or intermediate representations (Das et al., 2018).

A table summarizing the resource allocation for unified units in an FPGA eGPU context is:

Functionality Resource (eGPU) Implementation Detail
FP32 Multiply-Add DSP Block (per SP) Hard FP32 MAC, minimal ALM usage
INT32 Arithmetic ALMs + half DSP Block Add/subtract, 16×16 mult., logic
Shared Register/Memory M20K Embedded Memory blocks 4R1W multi-port design

Placement and balanced allocation of hard and soft logic resources, along with data-path integration, are critical to sustaining both high frequency (e.g., 770+ MHz on Intel Agilex FPGA (Langhammer et al., 2023)) and resource density for large-scale deployment.

3. Software and Architectural Co-Design: Dual-Issue and Decoupling

The challenges of maximizing throughput for mixed integer/FP workloads arise due to register dependencies, execution bottlenecks, and memory bandwidth contention. Software-hardware co-design strategies have addressed these through:

  • COPIFT Methodology: This involves static analysis and transformation steps—building data flow graphs, partitioning code into independent integer and FP "phases," software pipelining, loop tiling/fission, and exploiting programmable register semantics—to enable effective dual-issue execution on in-order cores without sizeable hardware overhead (Colagrande et al., 26 Mar 2025).
  • ISA Extensions: Modifying or extending the instruction set to decouple dependencies (e.g., mapping FP convert/comparison instructions to the FP register file, eliminating integer register contention on the FP pipeline), thus allowing integer and FP "threads" to proceed concurrently.
  • FREP Loops and SSR: Utilizing hardware loop controllers and stream semantic registers to orchestrate overlapping execution, with first-iteration scheduling setting up a repeatable issue scheme for both domains (Colagrande et al., 26 Mar 2025).

Empirical measurements indicate an average speedup of 1.47× and peak IPC of 1.75 (compared to RV32G optimized baselines), with energy improvement of 1.37×, and up to 1.93× in certain kernels such as exponentiation—demonstrating considerable gains in constrained hardware environments.

4. Precision Management and Mixed-Precision Techniques

Unification is further enabled by the careful handling of mixed or reduced precision, often leveraging integer units for efficiency but with mechanisms to maintain FP32-equivalent accuracy:

  • Dynamic Fixed Point (DFP) Arithmetic: Here, all elements of a tensor share a common exponent, so that INT16 FMA pipelines can multiply inputs and accumulate into an INT32 accumulator, while the shared exponent provides dynamic range. Conversion between formats is governed by equations such as fn=in×2Esf_n = i_n \times 2^{E_s} and Es(ab)=Esa+EsbE_s^{(ab)} = E_s^a + E_s^b (Das et al., 2018).
  • Error-Corrected Mixed-Precision: On systems such as NVIDIA Tensor Cores, where input matrices are "cast" to FP16 (or TF32) and accumulated in FP32, precision loss is mitigated by partitioning each input into a main component and a delta, reconstructing higher-accuracy results via multi-term correction and selective FP32 accumulation outside the core (e.g., CFP32AFP16BFP16+(ΔAFP16BFP16+...)/211C_{FP32} \leftarrow A_{FP16} \cdot B_{FP16} + (\Delta A_{FP16} \cdot B_{FP16} + ... )/2^{11}) (Ootomo et al., 2022).
  • Adaptive SIMD Multiplexing: In FlexPE, pipeline width and functional unit allocation are immediately adjusted via runtime signals, supporting up to 16× throughput in 4-bit, 8× in 8-bit, and down to 1× for 32-bit, all modulo identical hardware paths (Lokhande et al., 16 Dec 2024).

These approaches enable energy and area savings (e.g., 8.42 GOPS/W (Lokhande et al., 16 Dec 2024), 1.8× throughput improvement for CNN training (Das et al., 2018)) while making unified execution units viable for edge and high-performance cloud environments.

5. Overflow Management, Memory Bandwidth, and Dataflow

Efficient unified execution demands not only compute but robust support for numerical integrity and data movement:

  • Overflow Control: Accumulating products of INT16 or INT32 values risks exceeding accumulation word widths. Dynamic partial summing (with downconversion to FP32), input data shifting (e.g., all inputs shifted to DFP15), and judicious register blocking manage overflow probability to acceptable levels and limit overhead to less than 3% (Das et al., 2018).
  • Memory Layout and Data Path: In architectures like eGPU, careful placement of shared memory (e.g., M20K blocks for 4-read/1-write port operation) proximate to both FP and INT execution units minimizes latency and maximizes data reuse (Langhammer et al., 2023).
  • Intelligent Data Movement: The adoption of multiprecision SIMD systolic arrays and local buffer exploitation, as in FlexPE, yields substantial reductions in DMA bandwidth—up to 62× for feature maps and 371× for filters in DNN inference (Lokhande et al., 16 Dec 2024).

A plausible implication is that these optimizations are essential for sustaining high throughput on unified execution units, particularly in workloads with non-uniform precision or irregular compute-to-bandwidth ratios.

6. Applications Across Domains

Unified INT32/FP32 execution units are relevant in:

  • Deep Learning and Vision: Mixed-precision integer and floating-point arithmetic for convolution, GEMM, and non-linear activations, with demonstrated success on ResNet-50 and VGG-16 for both training and inference (Das et al., 2018, Lokhande et al., 16 Dec 2024).
  • Scientific Computing: Error-corrected matrix multiplication on Tensor Cores achieves both high performance (e.g., 51 TFlop/s vs. 19.5 TFlop/s theoretical FP32 peak) and FP32 accuracy, supporting iterative solvers and Fourier transforms (Ootomo et al., 2022).
  • Edge and Embedded Systems: Flexible, energy-efficient cores for signal processing, FFTs, matrix decompositions, and wireless communication algorithms, where balancing FP and INT computation is necessary (Langhammer et al., 2023, Lokhande et al., 16 Dec 2024).
  • General-Purpose Accelerator Arrays: Area- and power-constrained environments where even modest increases in per-core throughput and energy efficiency (1.47× speedup, 1.93× energy savings) scale across large parallel deployments (Colagrande et al., 26 Mar 2025).

7. Verification, Compiler, and Programming Aspects

Verification and programmability are increasingly important:

  • Formally Verified Execution: Use of FPUs for integer division has been formally verified in Coq and integrated into CompCert, ensuring correctness in scenarios previously vulnerable to non-constant-time and side-channel risks (Monniaux et al., 2022).
  • Programmability: API and compiler optimizations, such as partitioning kernels for concurrent integer/FP execution, as well as runtime-configurable arithmetic modes, advance the usability of unified units across domains.
  • ISA and Microarchitecture: RISC-V extensions, load-store elimination, and loop fission techniques explicitly support the unification of arithmetic domains at the ISA and microarchitectural level (Colagrande et al., 26 Mar 2025).

These features ensure both confidence in correctness and the ability to exploit architectural flexibility via standard programming models.


Unified INT32/FP32 execution units represent an overview of architectural, microarchitectural, and software-hardware co-design innovations. Through methods including dynamic fixed-point arithmetic, CORDIC-based unified datapaths, software-enabled dual-issue scheduling, and tightly integrated resource management, they provide a foundation for scalable, efficient, and verifiably correct computation across AI, HPC, and embedded domains. Experimental and theoretical results indicate that such architecture not only sustains high throughput and energy efficiency but can, when carefully optimized, deliver accuracy and performance on par with or exceeding legacy, domain-specific pipelines.