BOF4: Block-wise Optimal Float
- Block-wise Optimal Float (BOF4) is a 4-bit block quantization method that minimizes reconstruction error through optimized codebooks and block-level normalization.
- It leverages statistical modeling and algorithms like Lloyd’s to determine optimal quantization levels, achieving near-optimal inner-product performance with a standard block size of 64.
- BOF4 incorporates robust outlier handling and efficient encoding/decoding strategies to maintain high accuracy in large-scale deep neural networks and language models while reducing memory and compute overhead.
Block-wise Optimal Float (BOF4) is a class of block-quantized floating-point representations in which the quantization scheme, codebook, and block size are optimized for high information fidelity at low bit-width, particularly targeting efficient deployment and inference in large-scale deep neural networks and LLMs. The BOF4 family achieves near-optimal quantization error—especially for inner-product evaluations—by adapting both the quantization levels (codebook) and normalization strategy to the true tail behavior and intra-block value distribution, typically using 4 bits per element. It supersedes earlier schemes such as “normal float 4-bit” (NF4) by explicitly minimizing reconstruction error under the block-wise distributional constraints, and in recent implementations also supports empirically robust outlier handling. Multiple independent lines of work, both theoretical and applied, converge on related formulations for BOF4, with consensus around using a block size of 64 and specialized centroids and thresholds for LLM and DNN quantization.
1. Formal Definition and Core Construction
BOF4 arises from the general block floating-point (BFP) framework, where a vector or tensor is partitioned into contiguous blocks of size (or , , in different notations). Each block encodes:
- A shared scaling factor (exponent or per-block maximum/absmax, denoted , , or ).
- Per-element quantized mantissas with bits (typically ), representing signed integers in .
The canonical BOF4 quantization procedure is as follows:
- For block : Compute (or a variant such as the signed absolute maximum).
- Normalize: .
- Quantize: to nearest codebook value among levels via .
- Dequantize: .
BOF4 differs from related formats (e.g., BFP, SBFP, NF4) by tailoring the codebook, threshold layout, and block size for minimum end-to-end quantization error according to rigorous statistical modeling of , as opposed to using fixed-point or quantile designs (Soloveychik et al., 2022, Blumenberg et al., 10 May 2025).
2. Theoretical Optimization and Statistical Modeling
The error-minimization underlying BOF4 is grounded in both asymptotic analysis and empirical blockwise loss minimization:
- Given blockwise i.i.d. Gaussian inputs , the distribution of the normalized coordinate in a given block of size is a mixture: point mass at for the blockwise extreme(s), plus a continuous density concentrated around zero for typical coordinates.
- Specific MSE- and MAE-minimizing quantization levels are computed via weighted EM (Lloyd’s algorithm), or by closed-form integration when possible:
where denotes the quantization interval and the block-maximum distribution (Blumenberg et al., 10 May 2025).
- For MSE, this reduces to centroids as weighted means; for MAE, to weighted medians.
Empirically, the data-driven and theoretical codebook generation methods agree to high numerical accuracy ( dB), supporting the general applicability of the EM-derived codebook to a wide range of weight/activation statistics.
3. Block Size Selection and Error Bounds
BOF4’s effectiveness depends on precise tuning of block size for a given bit width. Statistical and empirical analysis has shown:
- For (4-bit mantissas), the optimal block size to minimize inner product error variance is [(Soloveychik et al., 2022), Proposition 4, Figure 1].
- The variance of quantization error for inner products, , is bounded and concentrates sub-Gaussianly; numerical evaluation of the associated integrals matches Monte Carlo simulations across synthetic and real DNN weight distributions.
- The “Relative Error Bound Accuracy Comparison” (REBAC, denoted ), defined as the error variance ratio relative to the ideal SBFP reference, reaches a minimum for in both synthetic and real-world neural weights.
| Block Size | REBAC | Description |
|---|---|---|
| 32 | higher | increased error due to underutilization |
| 64 | minimum | optimal trade-off, defines BOF4 |
| 128+ | increases | error grows due to block max effect |
This formalizes BOF4 as 4-bit BFP with .
4. Codebook Adaptation, Variations, and Comparison to Existing Schemes
While earlier block quantization formats such as NF4 or AF4 use either fixed codebooks or those derived from marginal distributions (e.g., Gaussian quantiles on ), BOF4 explicitly adapts codebook levels to the empirical and block-size–dependent distribution of normalized coordinates:
- NF4 is not information theoretically optimal across all block sizes; its codepoints at large have substantial mass at values seldom encountered () (Yoshida, 2023).
- BOF4 solves for codebooks that minimize actual reconstruction error (L1 or L2) given the empirical distribution for each block size .
- For small ($32, 64$), BOF4 and NF4 perform similarly in mean absolute error and downstream LLM perplexity.
- For large block sizes (), BOF4’s adaptation to central concentration yields 10–20% improved mean absolute error and consistently lower PPL in LLMs (Yoshida, 2023, Blumenberg et al., 10 May 2025).
- For block-size 64 (the default), BOF4 achieves marginal but consistent gains in perplexity relative to NF4/AF4 (Table 1 in (Blumenberg et al., 10 May 2025)).
BOF4–S is a variant where blocks are normalized to the signed absolute maximum (making the normalization factor positive or negative according to the sign of the true block max), improving the codepoint allocation for unimodal behavior and further lowering error.
5. Outlier Robustness and Mixed-Precision Augmentation
Blockwise quantization is sensitive to outlier values: a single high-magnitude element in a block can force the scale factor upward, compressing other elements toward zero.
BOF4 implementations incorporate several strategies for outlier handling:
- Channel (row) permutation (“K-sort”): rearrange tensor rows or channels by norm to collect outlier-heavy blocks together, minimizing intra-block dynamic range and reducing quantization-induced error for non-outlier elements (Trukhanov et al., 29 Mar 2024).
- Compile-time application only; no inference overhead.
- Outlier-Preserving Quantization (OPQ): Detect and store outlier weights at full 16-bit precision, while quantizing non-outliers with BOF4–S. Outliers are defined per block as weights exceeding a specified quantile of the absolute max distribution (Blumenberg et al., 10 May 2025).
In both approaches, empirical evaluation shows significant reduction in mean squared quantization error and perplexity, with near-FP16 accuracy (within 0.3–1.5%) and nearly memory savings over 8-bit quantization.
6. Implementation and Deployment
The BOF4 family is designed for practical hardware and software integration:
- Offline: Codebook and threshold computation is trivial (milliseconds), per block size, using closed-form updates or Lloyd’s.
- Runtime encoding: Matching codebooks and threshold tables are looked up per block; quantization reduces to normalization, a binary search, and table lookup.
- Decoding: Each block requires its scale and 4-bit code array; reconstruction is , efficiently implemented as a multiply–accumulate kernel.
- Memory and compute efficiency: For , exponent bits, and mantissa bits, bits/element is $4.125$, or savings over 8-bit integer, and over FP16 (Trukhanov et al., 29 Mar 2024).
- Hardware compatibility: Supports integer MAC units and fast barrel shifters; block-level layout is cache friendly; inference speed is indistinguishable from fixed-point/NF4 (Yoshida, 2023, Trukhanov et al., 29 Mar 2024).
The main deployment requirement is to ship a small lookup table of 16 floats and 15 thresholds per chosen block size.
7. Empirical Performance in Neural LLMs
BOF4 and its variants have been evaluated on large-scale LLMs (Llama-3.1 8B, Qwen-2.5 7B, Mistral 7B):
| Model | NF4 | AF4 | BOF4 (MSE) | BOF4-S (MSE) | BOF4-S (MSE)+OPQ |
|---|---|---|---|---|---|
| Llama-3.1 8B PPL | 8.53 | 8.51 | 8.51 | 8.46 | 8.43 |
| Qwen-2.5 7B PPL | 9.89 | 9.91 | 9.94 | 9.88 | 9.83 |
| Mistral 7B PPL | 8.90 | 8.90 | 8.89 | 8.88 | 8.87 |
BOF4–S (with MSE optimization) dominates or matches baselines, and OPQ provides further improvement, especially for large . These findings hold across both synthetic and natural network distributions.
Block-wise Optimal Float (BOF4) and its recent extensions comprise the state-of-the-art in 4-bit block-wise quantization for deep learning weights and activations, balancing memory, computational efficiency, and minimal degradation in numerical accuracy, with variants delivering empirical results competitive with much higher precision floating-point baselines (Soloveychik et al., 2022, Yoshida, 2023, Trukhanov et al., 29 Mar 2024, Blumenberg et al., 10 May 2025).