Block-Wise Optimal Float (BOF4)
- BOF4 is a quantization scheme for neural networks that uses 4-bit block-wise representations with mathematically optimized codebooks.
- It minimizes reconstruction error through block normalization and iterative centroid optimization using both analytical and data-driven methods.
- Variants like BOF4-S and OPQ extend the approach to handle signed normalization and outliers, enhancing hardware efficiency and model fidelity.
Block-Wise Optimal Float (BOF4) is a family of quantization schemes for neural network weights and activations, targeting extreme memory and compute efficiency with minimal accuracy loss. Centered on 4-bit block-wise representations, BOF4 optimizes representational and reconstruction error within quantization blocks, providing both mathematically principled codebooks and hardware-amenable formats. Recent research encompasses the original BOF4 as a block-wise optimal quantizer for normally distributed weights, its variants for signed normalization and outlier handling, as well as hardware-efficient and mixed-dialect extensions for accelerators and LLM-scale inference.
1. Mathematical Foundations and Quantizer Definition
The original BOF4 formulation targets the minimization of quantization error for normalized weights within blocks. Let denote a tensor of model parameters or activations, partitioned into blocks of size . Within each block, values are normalized by their (absolute or signed) block-maximum, yielding:
A 4-bit scalar quantizer then maps each to one of reconstruction levels . BOF4 codebooks are computed either to minimize the mean absolute (MAE) or mean squared (MSE) reconstruction error, using Lloyd-Max or k-medians iteration over the block-normalized distribution induced by block-wise normalization. The codebook optimization objective for cell is
with 0 denoting the bin boundaries. In practice, BOF4 centroids are estimated via either closed-form integration under a Gaussian, or data-driven weighted Lloyd-Max iterations using large samples from neural weights (Yoshida, 2023, Blumenberg et al., 10 May 2025).
2. Block-Wise Optimality: Distributional Adaptivity and Block Size
NF4 and related quantization schemes use fixed quantiles or midpoints assuming i.i.d. standard normal entries. However, block-wise normalization causes the intra-block normalized distribution 1 to concentrate increasingly near zero as block size 2 increases:
- For 3, the distribution 4 becomes sharply peaked at 5.
- The probability that 6 achieves boundary values (7) scales as 8.
- The cumulative distribution function for the interior is nonuniform and 9-dependent.
BOF4 solves for codepoint locations that minimize expected error under 0, yielding quantizers that adapt to the actual statistics induced by block-wise normalization. Empirically, BOF4 achieves negligible gains over NF4 at small 1 but significant (5–10% in perplexity, up to 10–20% lower MAE/MSE) improvements as 2 (Yoshida, 2023, Blumenberg et al., 10 May 2025). The optimal block size for 4-bit mantissa is analytically and numerically determined to be 3 (Soloveychik et al., 2022).
3. BOF4 Variants: Signed Normalization and Outlier-Preserving Extensions
Several BOF4 extensions address further representational inefficiencies and real-model distributional pathologies:
- BOF4-S (Signed-Absmax Normalization): Rather than normalizing by block absmax, normalization is by the signed maximum, 4 with 5. This ensures only 6 is exactly representable, preventing "wasted" quantizer capacity on unattainable 7 values and resulting in 10–20% lower error (Blumenberg et al., 10 May 2025).
- OPQ (Outlier-Preserving Quantization): To mitigate blocks with rare outliers—wherein the block maximum massively exceeds typical values—OPQ stores outliers above a quantile-derived threshold in 16 bits, replaces them with zero in the block, and applies BOF4-S to the non-outlier remainder. This yields ≲1.5% memory overhead and maintains model perplexity/convergence for large block sizes due to better preservation of the quantizer dynamic range (Blumenberg et al., 10 May 2025).
| Variant | Normalization | Outlier Handling | Best-Reported MAE/MSE | Hardware Impact |
|---|---|---|---|---|
| BOF4 | absmax | none | Baseline | Minimal overhead |
| BOF4-S | signed-absmax | none | 10–20% lower | Identical to BOF4 |
| OPQ+BOF4-S | signed-absmax | 16b outlier path | Top (up to +10% vs AF4/NF4) | ~1–2% extra storage |
4. Hardware Considerations and Block-Floating Point Architectures
Block-wise floating point (BFP and variants) and block-wise optimal quantization formats are converging on efficient integer arithmetic and shared block scaling. State-of-the-art accelerator integration includes:
- FC-SRAM-Backed LUTs: BOF4 in the SOP quantization framework places the 16-entry codebook in FC-SRAMs, with 4-bit indices and per-block scale (e.g., UE4M3) (Killian, 14 May 2026); decode is a one-cycle operation, and block-size 8 is typical for tiled GEMM hardware.
- BlockDialect: BOF4 generalizes to per-block optimal selection from a "DialectFP4" formatbook (set of 9 E2M1 variants), dynamically choosing the dialect format minimizing per-block quantization error (Jang et al., 2 Jan 2025). This improves effective bits-per-weight and energy efficiency.
- Scale-free Floating-Point Variants: BOF4 also denotes quad-radix (base-4) floating point, as in AetherFloat-8 (AF8), which sidesteps the need for runtime AMAX (block-scale) calculation by virtue of exponentially enlarged dynamic range. AF8 provides 0 coverage in 8 bits and yields a 33.17% area, 21.99% power, and 11.73% critical path reduction vs. FP8 E4M3 (Morisaki, 26 Feb 2026).
5. Optimization Algorithms and Codebook Computation
BOF4 centroids are derived via either theoretical closed-form or Monte Carlo (Lloyd-Max) optimization:
- Initialization: Uniform grid over 1 with fixed endpoints 2, 3, 4.
- Assignment: Map each normalized sample 5 to its closest centroid.
- Update: Adjust centroids by weighted mean (MSE) or weighted median (MAE), where sample weights correspond to (block max)6.
- Iteration: Repeat assignment/update until convergence (≈10–20 its) (Blumenberg et al., 10 May 2025, Yoshida, 2023).
For DialectFP4 selection (Jang et al., 2 Jan 2025), a two-stage lookup combines block-level dynamic range narrowing and distribution-aware tie-breaking using efficient bit logic.
The choice of error metric (7 vs 8) and normalization convention (absmax vs signed-absmax) determines the optimality of centroids for the intended objective and target distribution. Empirical validation confirms the theoretical update and Monte Carlo converge in the large-sample limit.
6. Empirical Results and Comparative Performance
Across LLMs (LLaMA3.1 8B, Qwen-2.5 7B), BOF4 and its enhancements deliver:
- Substantially lower MAE/MSE than both NF4 and AF4, with BOF4-S+OPQ yielding 10–20% improvement in typical block sizes (I = 64).
- Improved perplexity retention at large block sizes: BOF4-S+OPQ preserves PPL <10 up to I=512, while NF4 and AF4 degrade for I>128 (Blumenberg et al., 10 May 2025).
- In SOP hardware (Killian, 14 May 2026), BOF4 at 4.5 bpw nearly matches or exceeds the weight-MSE of conventional FP8 E4M3 (8.0 bpw).
- BOF4 as a per-block selection within BlockDialect improves zero-shot accuracy and lowers perplexity by 3.5–9.7 points over MXFP4, narrowing the gap to FP16 at just 4.2–4.3 effective bits (Jang et al., 2 Jan 2025).
- High hardware throughput and energy savings are achievable: on-chip MAC units for BOF4 (INT or FP4) use under 250 µm² and 135 µW at 0.5 GHz (45nm), yielding area and power gains over FP6 and INT8 (Jang et al., 2 Jan 2025).
7. Theoretical Justification, Limitations, and Applicability
BOF4’s optimality is block-relative and assumes i.i.d. or near-Gaussian distributions post-normalization; non-Gaussian or outlier-heavy cases necessitate variants (BOF4-S, OPQ). The format is best leveraged when hardware can exploit block-wise scaling or mixed-format arithmetic, as in modern accelerator or energy-efficient inference pipelines.
Block sizes are fundamentally constrained by the trade-off between error variance and scale quantization accuracy: for 4-bit mantissas, 9 minimizes the REBAC ratio, with empirical model validation confirming this prediction (Soloveychik et al., 2022). For tasks where block scaling is infeasible or dynamic range must be guaranteed, “scale-free” hardware-optimized BOF4 instantiations (e.g., AF8) become the preferred regime (Morisaki, 26 Feb 2026).
BOF4 codebooks are fixed after training but can be adapted or paired with other LUTs (e.g., Split87, SH4) in per-layer quantization for further gains (Killian, 14 May 2026).
BOF4 and its variants form a rigorously optimized, hardware-aligned, and empirically validated toolkit for low-bit neural quantization, achieving substantial compression and compute efficiency with virtually no loss in model fidelity when deployed at optimal block size and with proper handling of distributional tail events (Yoshida, 2023, Blumenberg et al., 10 May 2025, Soloveychik et al., 2022, Jang et al., 2 Jan 2025, Morisaki, 26 Feb 2026, Killian, 14 May 2026).