Numerical Formats & Hardware Architectures

Updated 16 June 2026

Numerical formats and hardware architectures are systems that use specialized number representations to optimize computational precision, performance, and energy efficiency across various applications.
These approaches include low-bit FP8, block-scaled integers, and hybrid methods that enable cost-effective, high-throughput arithmetic through shared exponent and integer pipelines.
Co-design strategies drive innovations in rounding, normalization, and conversion techniques, ensuring interoperability and controlled error bounds in modern digital systems.

Numerical formats and hardware architectures are inseparably linked, as the precise encoding of numbers in digital systems critically determines not only computational semantics but also hardware cost, performance, and energy efficiency. Over the past decade, an explosion of specialized number representations—ranging from ultra-low-bit floating-point mini-formats to block-scaled and residue-based systems—has been accompanied by dedicated hardware microarchitectures finely tuned to these formats. The co-design of numerical representation and architecture underpins advances in AI, scientific computing, and embedded signal processing, balancing density, speed, power, and accuracy requirements in modern systems.

1. Low-Bit Floating-Point and Integer Formats

Several recent developments have focused on compact representations, notably 8-bit floating-point (FP8), block-scaled integers ("microscaling"), and hybrids.

FP8 Standards: Common variants include E5M2 (5 exponent bits, 2 mantissa) and E4M3 (4 exponent, 3 mantissa), each using a single sign bit, a biased exponent, and "hidden-1" encoding. E5M2 covers exponents $e\in[-15,+16]$ with bias $b=15$ ; E4M3 spans $e\in[-7,8]$ with bias $b=7$ . Mantissa fields are quantized as $m = M / 2^p$ in $[0,1)$ (Lindberg et al., 2024, Noune et al., 2022).

Block-Scaled and Microscaling ("MX") Formats: Here, tensor values are partitioned into blocks (commonly size 32), each sharing a single exponent or scale. For MXInt8, one block of 32 stores an 8-bit exponent and 7-bit mantissas, achieving dynamic range of floating-point with average bitwidth $\approx$ 8.25 bits/element—arithmetic proceeds as $x_i = (–1)^{s_i} M_i 2^{e-\text{bias}-m}$ (Cheng et al., 2023, Su et al., 25 Jun 2025, Khodamoradi et al., 2024). The Blackwell architecture natively supports such block-scaled FP8 formats with hardware decode and multiplication (Su et al., 25 Jun 2025).

Fine-Grained Integer & Blockwise Quantization: INT8, INT4, and low-bit fixed-point are commonly used both with per-tensor and per-block scales. Accuracy and efficiency trade off against block size and outlier statistics; hardware can exploit simple barrel shifters for power-of-two scaling (Chen et al., 29 Oct 2025).

Alternative Formats: Posits and takums are alternatives to IEEE-754 offering variable precision; takums achieve higher consecutive integer coverage at equivalent bitwidth and superior representational strength relative to both floating-point and posit at moderate widths (Hunhold, 2024).

2. Hardware Arithmetic Pipelines for Reduced-Precision and Custom Formats

Integer-Based Floating-Point Operations

Approximate FP8 operations may be mapped onto integer pipelines:

For E5M2 and E4M3, bitfields are reinterpreted as signed log-domain integers, e.g., $\hat{X} = E \cdot 2^{p-1} + M - B$ .
Multiplication, division, reciprocal, square-root, and related ops are then integer addition/subtraction/shift with a carry-in computed by a small Boolean logic network per rounding mode.
Rounding-corrected results are achieved via conditional “carry-in,” which fits in a single FPGA LUT.
ASICs and FPGAs: Designs reduce area and critical path length (e.g., E4M3 multipliers show $\approx$ 50% area savings and clock 50% higher than classic FP8 units; FPGA LUT count halved and clock speed improved by 30–50%) (Lindberg et al., 2024).

This approach yields regular datapaths, no need for normalization, and vectorizable integer parallelism, with a tradeoff of bounded approximation error (1–2 ulp).

Block-Scaling Accelerators and Shared-Exponent MACs

In MX/block-scaled hardware units:

Multiplications within a block leverage shared exponent arithmetic, eliminating per-element exponent alignment and normalization; only one exponent adder per block.
Mantissas are handled as fixed-point, and only a single shared scale conversion is needed per block.
Resource sharing extends to SIMD units, dense systolic arrays, and PIM/NPU-PIM constructs, yielding 3–8× improvement in throughput and up to 50% area savings at iso-power compared to standard FP16 units (Cheng et al., 2023, Khodamoradi et al., 2024, Chen et al., 10 Nov 2025).
Implementation in PIM (DRAM-based compute units) becomes practical: low-bit (e.g., 4–8 bits) multipliers and accumulators mapped per block, with on-the-fly decode (Chen et al., 10 Nov 2025).

Hybrid/Alternative Architectures

Residue-Floating Hybrid: The HRFNA system marries residue number carry-free data paths (vectorized modular multipliers/adders) with global floating-point scaling (lightweight exponent update; normalization only rarely required, triggered via magnitude detection). On FPGA, this yields 2.4× throughput, 38–55% LUT reduction, and 1.9× energy efficiency improvement for scientific applications (FP32 baseline) (Darvishi, 21 Jan 2026).

Specialized Codecs for Posit/Takum: Enhanced code/decode pipelines are designed for formats with variable regime fields and characteristics (e.g., takum), leveraging fixed-width leading-one detectors, concurrent round-up/down, and minimal carry-chain lengths to optimize scaling with bitwidth (Hunhold, 2024).

3. Rounding, Normalization, and Precision in Hardware

Domain-Specific Rounding Semantics

For FP8 (E5M2, E4M3), full support for IEEE rounding modes (RN-even, RN-away, RZ, directed), often realized as a sum plus Boolean-corrected carry-in. Simple modes (RN-z, RZ) are often free; correct-tie-handling requires more complex Boolean terms (Lindberg et al., 2024).
In block-scaled or MX architectures, blockwise quantizers may introduce gradient bias in training due to asymmetry in integer codes; symmetric clipping corrects this, restoring unbiased convergence in low-bit regimes (Chen et al., 29 Oct 2025).
NVIDIA and AMD GPUs employ "deferred normalization": inputs are promoted (e.g., FP16/bfloat16 → FP32), multiple fma accumulate cycles proceed without normalization, and only after N-terms (e.g., N=8) does a single normalization+rounding step occur. Rounding for FP32 output is truncation for throughput reasons; for FP16 output, IEEE round-to-nearest is preserved (Khattak et al., 3 Sep 2025).

4. Conversion, Interoperability, and Error Bounds

Radix Conversion: Mixed-radix (binary↔decimal) conversion is handled by straight-line, non-iterative hardware pipelines: exponent conversion via a multiply table and small LUT, mantissa scaling via multiplier LUTs, a squaring tree, and multiplier-reduction. This supports exact exponents with mantissa error strictly bounded to <0.5 ulp, with sub-20 cycle, one-result-per-cycle hardware pipelines (Kupriianova et al., 2013).

Integral Representations: The minimum bitwidth to exactly represent any integer $b=15$ 0 and the largest consecutive integer covered differ widely by format:

IEEE-754 (fixed exponent+mantissa): Largest consecutive integer $b=15$ 1.
Posit (es=2): $b=15$ 2.
Takum (linear): $b=15$ 3, which overtakes posit and matches or exceeds IEEE at moderate bitwidth (Hunhold, 2024).

Hardware complexity follows: IEEE-754 remains lowest cost; posit and takum require additional regime/characteristic decode logic.

5. Application-Driven Format Selection and Performance

Deep Learning: Low-bit FP8 (E4M3/E5M2), block-scaled MXInt8, and 4–6 bit hybrids, with per-block scale, dominate both inference and training on LLMs and vision models (Cheng et al., 2023, Khodamoradi et al., 2024, Su et al., 25 Jun 2025, Su et al., 25 Jun 2025). For coarse granularity, floating-point is preferred; for fine-grained blocks (e.g., block-32), INT8 (with power-of-two scale) excels in both accuracy and hardware metrics, surpassing FP8 (Chen et al., 29 Oct 2025).

Scientific Computing: Custom FP16 variants (e.g., raising FP16 bias, reallocating exponent to mantissa), or low-range posits, are tailored to the narrow dynamical range of LBM and similar solvers, yielding up to 1.9× speedup versus FP32 and negligible error compared to FP64 (Lehmann et al., 2021). Mixed storage (16-bit DDFs, 32-bit computation) is standard.

Heterogeneous and Emerging Architectures: Dataflow hardware, FPGAs, and PIM platforms enable extreme pipelining, maximal SIMD reuse, and hybrid INT/FP pipelines, favoring blockwise or residue-based representations for maximal area and energy efficiency (Stylianou et al., 2023, Chen et al., 10 Nov 2025, Darvishi, 21 Jan 2026).

6. Energy, Throughput, and Cost Trade-Offs

Reduced-precision floating-point (e.g., FP8) and blockwise INT8 pipelines halve area and power at equivalent MAC throughput vs. FP16 or dense FP32, with end-to-end speedups of 2–4× on leading-edge silicon (Noune et al., 2022, Cheng et al., 2023, Chen et al., 10 Nov 2025).
Barrel-shifter-based power-of-two scaling and shared exponent amortization tangibly reduce control logic and switching activity.
Highly configurable FPGAs accept any bitwidth, favoring fixed-point or low-bit custom FP for DSP/vector kernels and flexible block scaling for AI/ML.

Summary Table: Hardware Area/Energy for Key Formats

Format	Area (norm. to FP8 MAC)	Energy (norm.)	Notes
MXFP8	1.00×	1.00×	E4M3+UE8M0 block-32
MXINT8	0.79×	0.63×	INT8+UE8M0 block-32
NVFP4	0.54×	0.55×	E4M3+E4M3, block-16
NVINT4	0.38×	0.34×	INT4+E4M3, block-16

(Chen et al., 29 Oct 2025, Cheng et al., 2023)

7. Conclusions and Future Directions

The ongoing co-development of numerical formats and hardware architectures is accelerating application performance and enabling power-efficient, scalable AI and scientific computation. Key trends include block-scaled formats (enabling low-bit, high-range arithmetic), integer-based approximate computation for ultra-compact units, hybrid architectures blending residue number systems with floating exponents, and hardware support for configurable rounding and normalization. There is no “one-format-fits-all”: detailed profiling, application-level error/tolerance analysis, and benchmarking across silicon targets remain essential to select optimal representations and microarchitectures for each workload domain (Sentieys et al., 2022, Chen et al., 29 Oct 2025, Cheng et al., 2023, Darvishi, 21 Jan 2026).