AetherFloat-8 (AF8) for AI Acceleration
- AetherFloat-8 (AF8) is an 8-bit floating-point architecture designed for AI co-design, featuring a wide dynamic range and deterministic integer comparability.
- It employs quad-radix scaling, an explicit 3-bit mantissa, and lexicographic one’s-complement unpacking to streamline hardware efficiency and inference.
- AF8’s block-scale-free property and quantization-aware training workflow enhance large language model performance while reducing area and power consumption.
AetherFloat-8 (AF8) is an 8-bit floating-point architecture designed explicitly for AI accelerator hardware-software co-design. As a member of the AetherFloat Family, AF8 departs from IEEE 754 conventions in favor of structural optimizations that provide a wide dynamic range, reduced power and area, deterministic integer comparability, and a quantization-aware training (QAT) workflow for LLM and inference deployment. AF8 is further distinguished by its block-scale-free property—eliminating the need for dynamic block-scaling (AMAX) hardware logic—and by an explicit 3-bit mantissa, quad-radix scaling, and vector-shared stochastic rounding (Morisaki, 26 Feb 2026).
1. Structural and Encoding Design
AF8 implements quad-radix (Base-4) scaling, an explicit mantissa, and lexicographic one’s-complement unpacking. The floating-point word comprises a sign bit (), a 4-bit exponent (, bias 7), and a 3-bit mantissa ().
- Quad-Radix Scaling: Each increment in exponent multiplies the value by 4, resulting in 2-bit shift alignment per addition or subtraction rather than the 1-bit shifts required by IEEE 754 base-2 formats. This architectural choice permits a shallow 2-stage 4-to-1 multiplexer for mantissa alignment and achieves quadratic dynamic range growth per exponent bit ( dB SQNR penalty vs. base-2).
- Explicit Mantissa Encoding (No Hidden Bit): For normal numbers (),
For subnormals (),
Subnormal handling is branchless, with subnormals processed through the standard multiplier and adder arrays without any microcode traps.
- Lexicographic One’s-Complement Unpacking: For negative values, the magnitude field is bitwise-inverted before integer comparability operations. This transformation enables zero-cycle, monotonic integer comparability (e.g., ReLU as a simple integer max) and efficient branchless subnormal handling.
2. Block-Scale-Free Property and Dynamic Range
AF8’s most salient distinction is the block-scale-free property. Standard 8-bit floating-point formats (e.g., FP8 E4M3, OCP MX) operate within a limited dynamic range and rely on AMAX logic to detect and scale maximum activations per tile, which guards against overflow but incurs hardware penalties. AF8, by contrast, natively absorbs activation outliers:
- Dynamic Range: Hardware-optimized AF8 provides a positive representable range of approximately to (mathematically idealized up to 0), orders of magnitude greater than FP8 E4M3 (1 to 2–3).
- Block-Scale-Free Inference: Outlier activations do not force neighboring values to zero, nor require extraction/stall, enabling a robust inference path without shared-exponent or dynamic block scaling.
| Format | Mantissa bits | Exponent | Min₊ | Max₊ |
|---|---|---|---|---|
| AF8 | 3 explicit | 4 (Base-4) | 4 | 5 |
| FP8 E4M3 | 3 + hidden | 4 (Base-2) | 6 | 7 |
3. Hardware Metrics and Implementation
Synthesis on SkyWater 130 nm demonstrates architectural efficiency relative to FP8 and legacy designs:
- Area: MAC unit area reduction from 8 (FP8) to 9 (AF8), a 0 saving.
- Power: Aggregate power reduction of 1 across MACs.
- Critical Path Delay: Reduction by 2 due to the shallow multiplier array (3×3, not 4×4 or 8×8).
- Area × Delay Product: Improvement of 3.
AF8 removes the need for AMAX hardware and exception logic, yielding additional savings in massively parallel neural accelerators.
- Vector-Shared Galois Stochastic Rounding: A single 32-bit Galois LFSR per SIMD lane supports stochastic rounding in backward passes during training, while forward inference employs deterministic rounding. This design bounds variance, eliminates vanishing gradients, and minimizes PRNG cost (one LFSR per vector, not per ALU).
4. Quantization, Training, and Inference Pipeline
AF8 mandates quantization-aware training (QAT) due to the lack of hidden bit and unique scaling behaviors:
- Workflow:
- Start from pretrained FP16/bfloat16 weights.
- Quantize weights and activations to AF8 with deterministic rounding in the forward pass.
- Compute loss in high precision.
- Backpropagate with STE (quantizer as identity), stochastically rounding gradients using vector-shared LFSR.
- Apply optimizer updates in full precision.
- Iterate until convergence.
- Inference Accuracy: On Qwen2.5-7B (post-training quantization), AF8 PTQ yields WikiText-2 perplexity of 4 (vs. 5 for BF16, 6 for FP8), PIQA 7 (8 for BF16), and HellaSwag 9 (0 for BF16), confirming that AF8 is not suitable for drop-in PTQ. QAT stabilizes performance and maintains gradient flow without AMAX stalls.
5. Comparative Analysis with Existing 8-bit Formats
AF8 is contrasted against industry 8-bit floating-point alternatives on several axes:
| Format | Block-Scale | Dynamic Range | Hidden Bit | Subnormals | QAT/PTQ |
|---|---|---|---|---|---|
| FP8 E4M3 | AMAX req. | 1 | Yes | Trapped | PTQ OK |
| OCP MX | AMAX req. | Shared-exponent across blk | Yes | Yes | PTQ OK |
| AF8 | Free | 2 | No | Branchless (1 step) | QAT required |
| AF16 | Free | 3 | No | Branchless | PTQ OK |
- Use-Case Highlights: AF8’s dynamic range is advantageous for LLM inference workloads with wide activation distributions. The lexicographic one’s-complement supports efficient zero-cycle control flows, permitting ReLU and max-pooling via integer compares without FPU bypass.
- Hardware Optimization: 33% reduction in area, 22% power reduction, and 12% lower critical path in MAC cores compared to FP8.
- Near-lossless Training: AF16 extends AF8’s design principles to match bfloat16 dynamic range and enables effective post-training quantization.
6. Alternative Encodings and Related Work
Alternative floating-point and log-domain formats have been proposed for reduced-precision inference. The log-float variant of AetherFloat-8, described in "Rethinking floating point for deep learning" (Johnson, 2018), employs a tapered posit-style regime-exponent encoding and a hybrid log-multiply/linear-add datapath with Kulisch accumulation. This log-float achieves competitive accuracy (–0.9% top-1 on ResNet-50/ImageNet), synthetizes at 0.96× the power and 1.12× the area of int8/32 MAC, and supports drop-in replacement without retraining—unlike AF8’s QAT requirement. Log-float architectures provide further dynamic range (4145 dB), while AF8 maximizes area and power reduction under quad-radix scaling and explicit integer comparability. Both families exemplify the larger trend toward architecture-aligned floating-point representations for efficient AI hardware (Morisaki, 26 Feb 2026, Johnson, 2018).