Papers
Topics
Authors
Recent
Search
2000 character limit reached

AetherFloat Family for AI Accelerators

Updated 2 July 2026
  • AetherFloat Family is a parameterizable floating-point system with explicit mantissa and quad-radix scaling, designed to address IEEE 754 inefficiencies in AI accelerators.
  • The architecture simplifies hardware by enabling zero-cycle integer comparability and reducing MAC unit area, delay, and power through efficient datapath design.
  • Variants like AF8 and AF16 offer an expanded dynamic range and enhanced quantization techniques, eliminating block-level scaling issues found in legacy FP8 formats.

The AetherFloat family refers to a parameterizable line of floating-point number systems and corresponding hardware datapath architectures developed specifically for AI accelerators, addressing the inefficiencies and scaling mismatches introduced by the structure of IEEE 754 formats and recent 8-bit AI-centric proposals such as FP8 E4M3 and OCP MX. The AetherFloat approach synthesizes Lexicographic One’s Complement Unpacking, Quad-Radix (Base-4) Scaling, and Explicit Mantissa Representation, combined with a vector-shared 32-bit Galois stochastic rounding topology, producing a hardware–software co-designed ecosystem optimized for both large-scale neural training and inference. Compared to IEEE 754–based formats, AetherFloat enables a simplified multiply-accumulate (MAC) datapath, removes the need for block-scaling (AMAX) logic, yields a substantial expansion in representable dynamic range, and delivers empirical reductions in MAC unit area, power, and critical path delay (Morisaki, 26 Feb 2026).

1. Architectural Foundations

AetherFloat is constructed atop four foundational innovations:

1. Lexicographic One’s-Complement Unpacking:

Standard IEEE 754 employs sign-magnitude encoding, which complicates integer-aligned operations due to the necessity of multi-cycle floating-point comparison logic. AetherFloat replaces this with an order-preserving One’s-Complement logic. Given an input XX of NN bits,

mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)

where \oplus is bitwise XOR. This encoding allows direct, zero-cycle integer comparability (max(0,x)\max(0,x), sorting, branching on thresholds) with no adder or carry-path.

2. Quad-Radix (Base-4) Scaling:

Discarding the traditional power-of-two normalization, AetherFloat scales exponents in powers of $4$. For normal values,

x=(1)S×M2p×4Eebiasx = (-1)^S \times \frac{M}{2^p} \times 4^{E - e_{\mathrm{bias}}}

with pp an implicit radix-point shift (1 for AF8, 6 for AF16). Alignment shifts in the MAC datapath now occur in 2-bit increments, reducing the barrel shifter from four to two stages and supporting a faster-growing dynamic range in limited bitwidth regimes.

3. Fully Explicit Mantissa Representation:

AF8 omits the hidden leading ‘1’ bit of IEEE 754, instead encoding all three mantissa bits directly (M[0,7]M \in [0,7], with specific constraints for normals/subnormals). Subnormals simply relax the leading-bit constraint and proceed through standard multiplier arrays without microcode or trap logic.

4. Vector-Shared 32-bit Galois Stochastic Rounding:

Instead of the typical per-unit PRNG for stochastic rounding, AetherFloat introduces a chunked topology, deploying a 32-bit Galois LFSR per SIMD lane (e.g., for 16 MAC units). This stochastically rounds results during backpropagation with bounded precision variance and low hardware overhead or correlation artifacts.

2. Format Specifications: AF8 and AF16

AetherFloat is instantiated primarily in two precision variants, each tuned to a different operational context.

AF8 (AetherFloat-8):

  • 8 bits total:
    • Sign (SS): 1 bit
    • Exponent (NN0): 4 bits, Base-4, bias NN1
    • Mantissa (NN2): 3 bits, explicit
  • Normals (NN3):

NN4

  • Subnormals (NN5):

NN6

  • Special values (NN7): NaN/Inf, lexicographically mapped to integer extremities
  • Minimum positive quantum: NN8
  • Practical dynamic range: NN9 to mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)0

AF16 (AetherFloat-16):

  • 16 bits total:
    • Sign: 1 bit
    • Exponent: 7 bits, Base-4, bias mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)1
    • Mantissa: 8 bits, explicit
  • Normals (mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)2):

mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)3

  • Subnormals (mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)4):

mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)5

Dynamic Range Comparison Table

Format Dynamic Range (practical)
AF8 mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)6 to mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)7
FP8 E4M3 mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)8 to mask=X(N1),U=(Xmask)\text{mask} = X \gg (N-1), \quad U=(X \oplus \text{mask})\,%%%%41%%%%\,(2^{N-1} - 1), \quad S = X \gg (N-1)9
bfloat16 \oplus0 to \oplus1

AF8’s dynamic range is markedly expanded relative to FP8, supporting inference workloads with substantial activation outliers without requiring dynamic rescaling.

3. Hardware Advantages and Trade-offs

AetherFloat-8 synthesizes to a simpler and smaller MAC datapath due to its explicit mantissa and quad-radix shifting, eliminating the need for half of the alignment barrel-shifter stages compared to FP8.

MAC Datapath Metrics

Metric E4M3 FP8 AF8 Relative Δ
Multiplier \oplus2 \oplus3
Area (\oplus4m\oplus5) 1018.48 680.65 \oplus6
Delay (ps) 2426.30 2141.60 \oplus7
Power (\oplus8W) 84.60 66.00 \oplus9

Zero-cycle integer comparability arises because the One’s-Complement unpack enables operations like ReLU and sorting, as well as threshold comparisons with no invocation of pipelined FPU logic. Subnormals are handled branchlessly since the datapath does not differentiate via trap logic or microcode.

4. The Block-Scale-Free Property

Legacy FP8 and OCP MX inference rely on a per-block AMAX scaling step to prevent overflow from LLM activation outliers:

max(0,x)\max(0,x)5 This step causes inference stalls and can collapse values when block maxima dominate, leading to local underflow for small elements.

AetherFloat-8’s expanded dynamic range eliminates AMAX logic entirely. Outliers are natively representable, so only an offline per-tensor quantizer is required during quantization-aware training (QAT). Inference no longer requires block-based dynamic scaling; all MACs use fixed decoding and summation hardware.

A plausible implication is that the absence of dynamic scaling reduces both control complexity and the risk of block-level quantization artifacts at deployment scale.

5. Quantization and Vector-Shared Stochastic Rounding

Stochastic rounding, essential for gradient preservation in low-precision training, is implemented via a 32-bit Galois LFSR shared per vector lane (e.g., per 16 MAC units). For each accumulation max(0,x)\max(0,x)0, rounding is performed to the nearest AetherFloat quantum max(0,x)\max(0,x)1 with probability: max(0,x)\max(0,x)2 This vector-shared approach amortizes logic (one LFSR per SIMD slice), controls SQNR wobble (max(0,x)\max(0,x)3 dB at 16-bit, absorbed by stochastic gradient descent), and prevents gradient vanishing observed in strict-quantized legacy formats. Inference disables the stochastic mechanism entirely, preserving determinism.

6. Application Domains and Empirical Performance

AF16 is positioned as a near-lossless bfloat16 replacement, suitable for direct post-training quantization:

Metric BF16 AF16 Δ
WikiText-2 PPL 8.7368 8.7380 +0.0012
HellaSwag Acc 0.5990 0.5999 +0.0009

AF8 is intended for scenarios where Quantization-Aware Training is viable. Pure post-training quantization leads to degradation, as small weights underflow below the minimum quantum max(0,x)\max(0,x)4. When QAT is used, leveraging straight-through estimators in the forward path and vector-shared stochastic rounding in the backward pass, training on Qwen2.5-7B converges to losses matching or exceeding FP8 E4M3 baselines by step 150, without AMAX instabilities. AF8 recovery from transient optimization spikes is stronger compared to legacy FP8.

The AetherFloat family expands efficient deep learning inference and training into regimes previously restricted by the trade-offs of traditional 8-bit floating-point, simplifying hardware, eliminating block-scaling, and enabling broader activation fidelity in LLM-scale deployments (Morisaki, 26 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AetherFloat Family.