Custom Floating-Point Formats

Updated 5 March 2026

Custom floating-point formats are user-defined numerical representations that split a fixed bit-width into sign, exponent, and mantissa fields to meet application-specific constraints.
They enable designers to balance dynamic range and precision through configurable exponent and mantissa sizes, enhancing performance in deep learning, scientific computing, and signal processing.
Recent research shows that adopting custom FP formats can lead to significant improvements in energy efficiency and throughput while preserving computational accuracy.

Custom floating-point (FP) formats are user- or domain-defined representations for real numbers with parameters like exponent width, mantissa width, and encoding rules set according to application-specific constraints. These formats allow designers to optimize for dynamic range, precision, energy, area, or application-level error, and stand in contrast to rigid IEEE 754 standard formats. Research across deep learning, scientific computing, hardware design, and software vectorization demonstrates vibrant activity in custom FP, with rigorous methodologies for design, analysis, and hardware/software co-design.

1. Format Definitions and Parameterization

A custom floating-point format typically splits a total bit-width $N$ into fields:

Sign: $1$ bit (optional for nonnegative-only signals)
Exponent: $E$ bits (controls dynamic range)
Mantissa (Fraction): $M$ bits (controls precision)
Bias: Typically $2^{E-1}-1$ (but can be tunable)

Normalized numbers are interpreted as:

$x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$

where $s$ is the sign bit, $e$ the stored exponent, $b$ the bias, $f$ the mantissa. Subnormals, NaN, $1$0, and IEEE-754 special value handling may be omitted for efficiency in custom variants, or, as with IEEE-style custom formats, retained for compatibility (Tambe et al., 2019, Mach et al., 2020, Sentieys et al., 2022, Bertaccini et al., 2022).

Nonlinear and tapered-precision encodings such as posit and takum decouple dynamic range and precision, allowing regime/exponent fields of variable or fixed width to "taper" the mantissa precision as a function of magnitude (Hunhold, 18 Mar 2025, Johnson, 2018, Hunhold et al., 29 Apr 2025, Luo et al., 2024). Some formats adapt field boundaries dynamically for entropy coding (EFloat) (Bordawekar et al., 2021).

Parameter configuration may be per-tensor, per-layer, per-block, or reconfigurable at runtime, as in FFP8, AdaptivFloat, or run-time reconfigurable FPGA/ASIC multipliers (Huang et al., 2021, Tambe et al., 2019, Arish et al., 2019).

2. Dynamic Range, Precision, and Format Design Rules

The tradeoff between exponent width (dynamic range) and mantissa width (precision) underpins custom FP design.

Dynamic Range: $1$1 for normalized IEEE-like formats; subnormals extend this (Sentieys et al., 2022, Mach et al., 2020).
Unit in the Last Place (ULP): At unit magnitude, $1$2; for numbers outside the [1,2) range, the ULP scales accordingly (Sentieys et al., 2022).
Rounding Error: With round-to-nearest-even, max error per operation is $1$3 ULP.
Field Splitting Principles: For a given word length $1$4, increasing $1$5 doubles range but halves mantissa precision. Applications with large value distributions need $1$6 even for 8-bit formats. Application-specific error budgets dictate the required $1$7, sometimes as low as $1$8–$1$9 (Sentieys et al., 2022, Bertaccini et al., 2022).
Layer/Tensor Specialization: Modern DNNs and signal domains often benefit from per-layer or per-tensor FP format specialization, exploiting narrower dynamic range in intermediate tensors to increase mantissa bits (Tambe et al., 2019, Huang et al., 2021).

Formats such as AdaptivFloat explicitly recompute an exponent bias per layer to always maximize representable range and minimize quantization error—an approach that consistently outperforms block floating-point, posit, and uniform integer quantization under tight bitwidth budgets (Tambe et al., 2019).

3. Hardware, Algorithmic, and Software Implementations

Hardware Implementations

Customizable FPUs: Open-source units like FPnew implement a datapath parameterized by $E$ 0 fields, supporting fast switching between multiple custom or standard floating-point formats. Fine-grained SIMD capabilities are integrated for energy and throughput proportionality down to 8 bits. Going from 64 to 8 bits yields over $E$ 1 energy and throughput improvement in the FPnew silicon measurements (Mach et al., 2020).
Dynamic and Run-Time Reconfiguration: Some designs allow dynamic selection among several exponent/mantissa split modes (fixed at $E$ 2, $E$ 3, $E$ 4, $E$ 5, $E$ 6 bits mantissa) per operand, facilitating adaptation to the accuracy/performance tradeoff at runtime (Arish et al., 2019). FFP8 and AdaptivFloat also use per-layer coefficients or runtime configuration to optimize quantization windows (Huang et al., 2021, Tambe et al., 2019).
Specialized Hardware: Designs may omit denormals, reduce logic for rounding, and use compact mantissa multipliers or hybrid algorithms (e.g., Karatsuba + Vedic multipliers) for area and energy gains, especially in FPGA deployments (Arish et al., 2019, Campos et al., 2024).

Software and Algorithmic Implementations

Bitslice Vectorization: Bit-level parallelism allows software emulation of arbitrary-precision FP arithmetic using regular integer SIMD units, circumventing fixed hardware width and offering efficient FP for low $E$ 7 ( $E$ 8). This approach excels when vector width is large and precision is small (Xu et al., 2016, Garland et al., 2020).
Domain-Specific Libraries and DSLs: Libraries such as FlexFloat (Tagliavini et al., 2017) and custom DSLs (Campos et al., 2024) facilitate rapid prototyping of circuits and algorithms using user-defined precision, with support for vectorization and automatic pipeline register balancing.
Mixed-Precision Toolchains: Tools iterate over call-sites or tensor-variables to minimize $E$ 9 per site, subject to an overall error budget defined using application metrics (e.g., SQNR, ULP error) (Defour et al., 2020, Tagliavini et al., 2017).

4. Empirical Results, Comparative Analyses, and Application Domains

Custom FP research corroborates improvements in inference/training accuracy, resource utilization, and energy efficiency.

Deep Learning: AdaptivFloat delivers up to $M$ 0 BLEU and $M$ 1 WER over FP32 at 8-bit and 6-bit quantization, outperforming 8-bit integer, block floating-point, and posit variants on Transformers, LSTMs, and CNNs (Tambe et al., 2019). FFP8 achieves Top-1 accuracy within $M$ 2 of FP32 with per-layer tuning, with negligible hardware cost (Huang et al., 2021).
LLMs: Microscaling (MXInt/BFP) formats allow LLM inference at average mantissa $M$ 3 bits and $M$ 4 loss vs. FP32, with near-int8 area/energy densities (Cheng et al., 2023). EFloat (entropy-coded floats) reclaims $M$ 5 bits from exponent into mantissa per value, with EF16 yielding $M$ 6– $M$ 7 lower RMSE vs. BF16 (Bordawekar et al., 2021).
Scientific Computing: Custom FPUs (FlexFloat-based) achieve an $M$ 8 energy reduction and $M$ 9 runtime reduction compared to all-32-bit, with most variables mapped to 8- or 16-bit formats (Tagliavini et al., 2017).
Signal Processing/FFT: In FFT-based spectral algorithms, posit8 and takum8 outperform E4M3/E5M2 and bfloat16, which are unstable due to limited range. Takum16 is specifically recommended for moderate-precision FFT/PDE tasks because of stable, high SNR (Hunhold et al., 29 Apr 2025).
Approximate Search/Compression: Custom 8-bit unsigned formats, optimized for target value distribution (e4m4, e5m3), reduce memory bank conflicts and latency in GPU-based ANN search, with recall loss $2^{E-1}-1$ 0 (Ootomo et al., 2023).
RDBMS and Reproducibility: Superaccumulator-based associative custom FP structures make floating-point aggregation reproducible at a cost of approximately $2^{E-1}-1$ 1 in end-to-end wall-clock time, enabling consistent numerical results in high-cardinality group-by queries (Müller et al., 2018).

5. Trade-Offs, Design Guidance, and Methodologies

Designers face multidimensional trade-offs governed by format parameters, hardware constraints, and application-level correctness or quality targets:

Energy/Area/Throughput vs. Precision/Range: Lower bit-widths reduce silicon area and energy linearly (for adders) or sublinearly (for multipliers), but too little exponent width induces overflows/underflows; too little mantissa impairs numerical accuracy (Sentieys et al., 2022).
Application Mapping: For inference and training of DNNs, 8–10 bit floats (with $2^{E-1}-1$ 2– $2^{E-1}-1$ 3) typically dominate fixed-point in $2^{E-1}-1$ 4; for linear DSP, 12–16 bit fixed-point can outperform FP in accuracy/energy (Sentieys et al., 2022).
Layer/Block/Cluster Adaptation: Layer-specific tuning (AdaptivFloat) and block-wise exponent sharing (MXInt/BFP) enable near-FP32 accuracy at substantially reduced bitwidth and resource (Cheng et al., 2023, Tambe et al., 2019).
Mixed-Precision and Dynamic Policies: Profiling tools (VPREC-libm, FlexFloat, MASE) enable per-call/per-tensor $2^{E-1}-1$ 5 allocation under error, energy, or area constraints (Defour et al., 2020, Cheng et al., 2023).

Suggested methodology for format selection (Sentieys et al., 2022, Defour et al., 2020):

Profile value distributions to determine required dynamic range.
Choose exponent width so that all important data avoid overflow.
Allocate remaining bits to the mantissa to meet target error.
Validate via simulated or hardware-in-the-loop evaluation.
Where possible, tune formats per layer/tensor/call site to exploit local dynamic range/precision needs.

6. Novel Encodings and Alternatives to IEEE 754

Several lines of research pursue alternatives to static IEEE 754-style field splits to further optimize resource-accuracy trade-offs:

Tapered Precision (Posit, Takum, HiFloat8): Regime-based encodings (posit, takum) concentrate precision near $2^{E-1}-1$ 6, with reduced precision for extreme magnitudes, improving overall representational efficiency (Hunhold, 18 Mar 2025, Johnson, 2018, Luo et al., 2024).
Entropy Coding (EFloat): Adaptive-length exponent codes reclaim bits for significand, leveraging nonuniform exponent distributions in data (notably in embeddings), providing lossless ±FP32 range at much lower total bitwidth (Bordawekar et al., 2021).
MICROSCALE FORMATS/BLOCK FLOAT: Block floating-point with shared exponents (MXInt, BFP) and per-block adaptation is effective in compressing weights/activations with nonnegative impact on LLM accuracy, dramatically reducing area and arithmetic cost (Cheng et al., 2023).
Hardware Optimization via Bitslice: Bitslice FP arithmetic, mapping arbitrary custom $2^{E-1}-1$ 7 splits to wide integer SIMD instructions, provides high-throughput prototype evaluation and directly supports arbitrary-precision FP in software and hardware accelerators (Xu et al., 2016, Garland et al., 2020).

Format or Approach	Exponent Config	Mantissa Config	Notable Features
AdaptivFloat (Tambe et al., 2019)	Per-layer variable bias $2^{E-1}-1$ 8	Fixed $2^{E-1}-1$ 9	Layer-specific dynamic range max/min, optimal clipping/rounding
FPnew (Mach et al., 2020)	Param. $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 0	Param. $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 1	Multi-format FPU, scalar+SIMD, full IEEE-754 compliance
EFloat (Bordawekar et al., 2021)	Huffman code (δ bits avg)	$x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 2	Exponent entropy coding, average 4-bit exponent, maximized mantissa
HOBFLOPS (Garland et al., 2020)	Any $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 3	Any $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 4	Bitslice vectorization, software/hardware symmetry
FFP8 (Huang et al., 2021)	$x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 5 (per-tensor/layer)	$x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 6 (per-tensor/layer)	Per-tensor format, tunable sign, exponent bias, zero retraining
MXInt/BFP (Cheng et al., 2023)	Block-shared $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 7	Per-elem $x = (-1)^s \times 2^{e-b} \times \left(1 + \frac{f}{2^M}\right)$ 8	High dynamic range, mixed-precision search, dataflow accelerator
Posit/Takum (Hunhold, 18 Mar 2025, Hunhold et al., 29 Apr 2025)	Regime + exp	Tapered with magnitude	Nonlinear dynamic range/precision curve, class-universal encoding

7. Limitations, Future Directions, and Ongoing Developments

While custom FP formats provide compelling leverage for performance and energy, limitations persist:

Hardware Complexity: Nonstandard encodings (posits, takum, HiFloat8) require regime/scale decoding and sometimes result in irregular pipeline paths or more complicated arithmetic units (Hunhold, 18 Mar 2025, Johnson, 2018).
Conversion Overhead: In software implementations, bitslice and custom FP representations may incur packing/unpacking costs, particularly above 16 bits or with irregular memory access (Xu et al., 2016).
Mixed-Precision Fragmentation: Excessively granular or dynamic allocation of formats risks conversion overheads and poor vectorization (Defour et al., 2020, Tagliavini et al., 2017).
Application-Specific Trade-offs: Some algorithms (large FFTs, PDE solvers) are exceptionally sensitive to dynamic range; others demand uniform precision. Custom FP adoption requires systematic benchmarking within target domains (Hunhold et al., 29 Apr 2025, Sentieys et al., 2022).
Standardization vs. Specialization: AVX10.2 and other ISAs now support multiple low-precision FP formats (E4M3/E5M2/bfloat16), but integrating a single universal, tapered-precision number format (e.g., takum) is proposed as a way to remove complexity (Hunhold, 18 Mar 2025).

Ongoing research is focused on unified encoding/decoding pipelines, extending compiler toolchains for pervasive mixed-precision inference/training, leveraging entropy and data-aware field mapping (as in EFloat), and exploring design-space exploration tools for automated hardware-software co-design at the system level.

References: