AetherFloat-8 (AF8) for AI Acceleration

Updated 2 July 2026

AetherFloat-8 (AF8) is an 8-bit floating-point architecture designed for AI co-design, featuring a wide dynamic range and deterministic integer comparability.
It employs quad-radix scaling, an explicit 3-bit mantissa, and lexicographic one’s-complement unpacking to streamline hardware efficiency and inference.
AF8’s block-scale-free property and quantization-aware training workflow enhance large language model performance while reducing area and power consumption.

AetherFloat-8 (AF8) is an 8-bit floating-point architecture designed explicitly for AI accelerator hardware-software co-design. As a member of the AetherFloat Family, AF8 departs from IEEE 754 conventions in favor of structural optimizations that provide a wide dynamic range, reduced power and area, deterministic integer comparability, and a quantization-aware training (QAT) workflow for LLM and inference deployment. AF8 is further distinguished by its block-scale-free property—eliminating the need for dynamic block-scaling (AMAX) hardware logic—and by an explicit 3-bit mantissa, quad-radix scaling, and vector-shared stochastic rounding (Morisaki, 26 Feb 2026).

1. Structural and Encoding Design

AF8 implements quad-radix (Base-4) scaling, an explicit mantissa, and lexicographic one’s-complement unpacking. The floating-point word comprises a sign bit ( $S$ ), a 4-bit exponent ( $E \in [0, 15]$ , bias 7), and a 3-bit mantissa ( $M \in [0, 7]$ ).

Quad-Radix Scaling: Each increment in exponent multiplies the value by 4, resulting in 2-bit shift alignment per addition or subtraction rather than the 1-bit shifts required by IEEE 754 base-2 formats. This architectural choice permits a shallow 2-stage 4-to-1 multiplexer for mantissa alignment and achieves quadratic dynamic range growth per exponent bit ( $~3.04$ dB SQNR penalty vs. base-2).
Explicit Mantissa Encoding (No Hidden Bit): For normal numbers ( $1 \leq E \leq 14, \text{leading 2 bits of } M \neq 00$ ),

$x = (-1)^S \times \frac{M}{2} \times 4^{(E-7)}.$

For subnormals ( $E=0, M\in\{0,1\}$ ),

$x = (-1)^S \times M \times 2^{-13}.$

Subnormal handling is branchless, with subnormals processed through the standard multiplier and adder arrays without any microcode traps.

Lexicographic One’s-Complement Unpacking: For negative values, the magnitude field is bitwise-inverted before integer comparability operations. This transformation enables zero-cycle, monotonic integer comparability (e.g., ReLU as a simple integer max) and efficient branchless subnormal handling.

2. Block-Scale-Free Property and Dynamic Range

AF8’s most salient distinction is the block-scale-free property. Standard 8-bit floating-point formats (e.g., FP8 E4M3, OCP MX) operate within a limited dynamic range and rely on AMAX logic to detect and scale maximum activations per tile, which guards against overflow but incurs hardware penalties. AF8, by contrast, natively absorbs activation outliers:

Dynamic Range: Hardware-optimized AF8 provides a positive representable range of approximately $1.22\times 10^{-4}$ to $5.73\times 10^4$ (mathematically idealized up to $E \in [0, 15]$ 0), orders of magnitude greater than FP8 E4M3 ( $E \in [0, 15]$ 1 to $E \in [0, 15]$ 2– $E \in [0, 15]$ 3).
Block-Scale-Free Inference: Outlier activations do not force neighboring values to zero, nor require extraction/stall, enabling a robust inference path without shared-exponent or dynamic block scaling.

Format	Mantissa bits	Exponent	Min₊	Max₊
AF8	3 explicit	4 (Base-4)	$E \in [0, 15]$ 4	$E \in [0, 15]$ 5
FP8 E4M3	3 + hidden	4 (Base-2)	$E \in [0, 15]$ 6	$E \in [0, 15]$ 7

3. Hardware Metrics and Implementation

Synthesis on SkyWater 130 nm demonstrates architectural efficiency relative to FP8 and legacy designs:

Area: MAC unit area reduction from $E \in [0, 15]$ 8 (FP8) to $E \in [0, 15]$ 9 (AF8), a $M \in [0, 7]$ 0 saving.
Power: Aggregate power reduction of $M \in [0, 7]$ 1 across MACs.
Critical Path Delay: Reduction by $M \in [0, 7]$ 2 due to the shallow multiplier array (3×3, not 4×4 or 8×8).
Area × Delay Product: Improvement of $M \in [0, 7]$ 3.

AF8 removes the need for AMAX hardware and exception logic, yielding additional savings in massively parallel neural accelerators.

Vector-Shared Galois Stochastic Rounding: A single 32-bit Galois LFSR per SIMD lane supports stochastic rounding in backward passes during training, while forward inference employs deterministic rounding. This design bounds variance, eliminates vanishing gradients, and minimizes PRNG cost (one LFSR per vector, not per ALU).

4. Quantization, Training, and Inference Pipeline

AF8 mandates quantization-aware training (QAT) due to the lack of hidden bit and unique scaling behaviors:

Workflow:

Start from pretrained FP16/bfloat16 weights.
Quantize weights and activations to AF8 with deterministic rounding in the forward pass.
Compute loss in high precision.
Backpropagate with STE (quantizer as identity), stochastically rounding gradients using vector-shared LFSR.
Apply optimizer updates in full precision.
Iterate until convergence.

Inference Accuracy: On Qwen2.5-7B (post-training quantization), AF8 PTQ yields WikiText-2 perplexity of $M \in [0, 7]$ 4 (vs. $M \in [0, 7]$ 5 for BF16, $M \in [0, 7]$ 6 for FP8), PIQA $M \in [0, 7]$ 7 ( $M \in [0, 7]$ 8 for BF16), and HellaSwag $M \in [0, 7]$ 9 ( $~3.04$ 0 for BF16), confirming that AF8 is not suitable for drop-in PTQ. QAT stabilizes performance and maintains gradient flow without AMAX stalls.

5. Comparative Analysis with Existing 8-bit Formats

AF8 is contrasted against industry 8-bit floating-point alternatives on several axes:

Format	Block-Scale	Dynamic Range	Hidden Bit	Subnormals	QAT/PTQ
FP8 E4M3	AMAX req.	$~3.04$ 1	Yes	Trapped	PTQ OK
OCP MX	AMAX req.	Shared-exponent across blk	Yes	Yes	PTQ OK
AF8	Free	$~3.04$ 2	No	Branchless (1 step)	QAT required
AF16	Free	$~3.04$ 3	No	Branchless	PTQ OK

Use-Case Highlights: AF8’s dynamic range is advantageous for LLM inference workloads with wide activation distributions. The lexicographic one’s-complement supports efficient zero-cycle control flows, permitting ReLU and max-pooling via integer compares without FPU bypass.
Hardware Optimization: 33% reduction in area, 22% power reduction, and 12% lower critical path in MAC cores compared to FP8.
Near-lossless Training: AF16 extends AF8’s design principles to match bfloat16 dynamic range and enables effective post-training quantization.

Alternative floating-point and log-domain formats have been proposed for reduced-precision inference. The log-float variant of AetherFloat-8, described in "Rethinking floating point for deep learning" (Johnson, 2018), employs a tapered posit-style regime-exponent encoding and a hybrid log-multiply/linear-add datapath with Kulisch accumulation. This log-float achieves competitive accuracy (–0.9% top-1 on ResNet-50/ImageNet), synthetizes at 0.96× the power and 1.12× the area of int8/32 MAC, and supports drop-in replacement without retraining—unlike AF8’s QAT requirement. Log-float architectures provide further dynamic range ( $~3.04$ 4145 dB), while AF8 maximizes area and power reduction under quad-radix scaling and explicit integer comparability. Both families exemplify the larger trend toward architecture-aligned floating-point representations for efficient AI hardware (Morisaki, 26 Feb 2026, Johnson, 2018).

Markdown Report Issue Upgrade to Chat

References (2)

The AetherFloat Family: Block-Scale-Free Quad-Radix Floating-Point Architectures for AI Accelerators (2026)

Rethinking floating point for deep learning (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AetherFloat-8 (AF8).