Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

BOF4: Block-wise Optimal Float

Updated 13 November 2025
  • Block-wise Optimal Float (BOF4) is a 4-bit block quantization method that minimizes reconstruction error through optimized codebooks and block-level normalization.
  • It leverages statistical modeling and algorithms like Lloyd’s to determine optimal quantization levels, achieving near-optimal inner-product performance with a standard block size of 64.
  • BOF4 incorporates robust outlier handling and efficient encoding/decoding strategies to maintain high accuracy in large-scale deep neural networks and language models while reducing memory and compute overhead.

Block-wise Optimal Float (BOF4) is a class of block-quantized floating-point representations in which the quantization scheme, codebook, and block size are optimized for high information fidelity at low bit-width, particularly targeting efficient deployment and inference in large-scale deep neural networks and LLMs. The BOF4 family achieves near-optimal quantization error—especially for inner-product evaluations—by adapting both the quantization levels (codebook) and normalization strategy to the true tail behavior and intra-block value distribution, typically using 4 bits per element. It supersedes earlier schemes such as “normal float 4-bit” (NF4) by explicitly minimizing reconstruction error under the block-wise distributional constraints, and in recent implementations also supports empirically robust outlier handling. Multiple independent lines of work, both theoretical and applied, converge on related formulations for BOF4, with consensus around using a block size of 64 and specialized centroids and thresholds for LLM and DNN quantization.

1. Formal Definition and Core Construction

BOF4 arises from the general block floating-point (BFP) framework, where a vector or tensor is partitioned into contiguous blocks of size nn (or BB, II, NN in different notations). Each block encodes:

  • A shared scaling factor (exponent or per-block maximum/absmax, denoted SS, mbm_b, or MM).
  • Per-element quantized mantissas MiM_i with pp bits (typically p=4p=4), representing signed integers in {(2p11),,2p11}\{-(2^{p-1}-1),\dots,2^{p-1}-1\}.

The canonical BOF4 quantization procedure is as follows:

  1. For block bb: Compute mb=maxiwb,im_b = \max_i |w_{b,i}| (or a variant such as the signed absolute maximum).
  2. Normalize: xb,i=wb,i/mb[1,1]x_{b,i} = w_{b,i}/m_b \in [-1,1].
  3. Quantize: xb,ix_{b,i} to nearest codebook value x^()\hat x(\ell) among L=16L=16 levels via =argminxb,ix^()\ell^* = \arg\min_\ell |x_{b,i} - \hat x(\ell)|.
  4. Dequantize: w^b,i=mbx^()\hat w_{b,i} = m_b \cdot \hat x(\ell^*).

BOF4 differs from related formats (e.g., BFP, SBFP, NF4) by tailoring the codebook, threshold layout, and block size for minimum end-to-end quantization error according to rigorous statistical modeling of xb,ix_{b,i}, as opposed to using fixed-point or quantile designs (Soloveychik et al., 2022, Blumenberg et al., 10 May 2025).

2. Theoretical Optimization and Statistical Modeling

The error-minimization underlying BOF4 is grounded in both asymptotic analysis and empirical blockwise loss minimization:

  • Given blockwise i.i.d. Gaussian inputs XiN(0,σ2)X_i \sim N(0, \sigma^2), the distribution of the normalized coordinate X=w/MX = w/M in a given block of size nn is a mixture: point mass at X=1|X|=1 for the blockwise extreme(s), plus a continuous density concentrated around zero for typical coordinates.
  • Specific MSE- and MAE-minimizing quantization levels {x^()}=116\{\hat x(\ell)\}_{\ell=1}^{16} are computed via weighted EM (Lloyd’s algorithm), or by closed-form integration when possible:

x^()=0m2E[XM=m,  XR]pM(m)[FX(ξ()M=m)FX(ξ(1)M=m)]dm0m2pM(m)[FX(ξ()M=m)FX(1)M=m)]dm\hat x(\ell) = \frac{\int_0^\infty m^2 \, \mathbb{E}[X \mid M = m,\; X \in \mathcal{R}_\ell] \, p_M(m) \, [F_X(\xi(\ell)\mid M=m) - F_X(\xi(\ell-1)\mid M=m)] \, dm}{\int_0^\infty m^2 \, p_M(m) \, [F_X(\xi(\ell)\mid M=m) - F_X(\ell-1)\mid M=m)] \, dm}

where R\mathcal{R}_\ell denotes the quantization interval and pM(m)p_M(m) the block-maximum distribution (Blumenberg et al., 10 May 2025).

  • For MSE, this reduces to centroids as weighted means; for MAE, to weighted medians.

Empirically, the data-driven and theoretical codebook generation methods agree to high numerical accuracy (<60<-60 dB), supporting the general applicability of the EM-derived codebook to a wide range of weight/activation statistics.

3. Block Size Selection and Error Bounds

BOF4’s effectiveness depends on precise tuning of block size nn for a given bit width. Statistical and empirical analysis has shown:

  • For p=4p=4 (4-bit mantissas), the optimal block size to minimize inner product error variance is n=64n=64 [(Soloveychik et al., 2022), Proposition 4, Figure 1].
  • The variance of quantization error for inner products, Var(ΔE)\mathrm{Var}(\Delta E), is bounded and concentrates sub-Gaussianly; numerical evaluation of the associated integrals matches Monte Carlo simulations across synthetic and real DNN weight distributions.
  • The “Relative Error Bound Accuracy Comparison” (REBAC, denoted ρvar\rho_{\text{var}}), defined as the error variance ratio relative to the ideal SBFP reference, reaches a minimum for n=64n=64 in both synthetic and real-world neural weights.
Block Size nn REBAC ρvar(n)\rho_{\text{var}}(n) Description
32 higher increased error due to underutilization
64 minimum optimal trade-off, defines BOF4
128+ increases error grows due to block max effect

This formalizes BOF4 as 4-bit BFP with n=64n=64.

4. Codebook Adaptation, Variations, and Comparison to Existing Schemes

While earlier block quantization formats such as NF4 or AF4 use either fixed codebooks or those derived from marginal distributions (e.g., Gaussian quantiles on [1,1][-1,1]), BOF4 explicitly adapts codebook levels to the empirical and block-size–dependent distribution of normalized coordinates:

  • NF4 is not information theoretically optimal across all block sizes; its codepoints at large nn have substantial mass at values seldom encountered (±1\pm1) (Yoshida, 2023).
  • BOF4 solves for codebooks that minimize actual reconstruction error (L1 or L2) given the X=w/MX=w/M empirical distribution for each block size BB.
    • For small BB ($32, 64$), BOF4 and NF4 perform similarly in mean absolute error and downstream LLM perplexity.
    • For large block sizes (B64B\gg 64), BOF4’s adaptation to central concentration yields 10–20% improved mean absolute error and consistently lower PPL in LLMs (Yoshida, 2023, Blumenberg et al., 10 May 2025).
    • For block-size 64 (the default), BOF4 achieves marginal but consistent gains in perplexity relative to NF4/AF4 (Table 1 in (Blumenberg et al., 10 May 2025)).

BOF4–S is a variant where blocks are normalized to the signed absolute maximum (making the normalization factor positive or negative according to the sign of the true block max), improving the codepoint allocation for unimodal behavior and further lowering error.

5. Outlier Robustness and Mixed-Precision Augmentation

Blockwise quantization is sensitive to outlier values: a single high-magnitude element in a block can force the scale factor mbm_b upward, compressing other elements toward zero.

BOF4 implementations incorporate several strategies for outlier handling:

  • Channel (row) permutation (“K-sort”): rearrange tensor rows or channels by norm to collect outlier-heavy blocks together, minimizing intra-block dynamic range and reducing quantization-induced error for non-outlier elements (Trukhanov et al., 29 Mar 2024).
    • Compile-time application only; no inference overhead.
  • Outlier-Preserving Quantization (OPQ): Detect and store outlier weights at full 16-bit precision, while quantizing non-outliers with BOF4–S. Outliers are defined per block as weights exceeding a specified quantile of the absolute max distribution (Blumenberg et al., 10 May 2025).

In both approaches, empirical evaluation shows significant reduction in mean squared quantization error and perplexity, with near-FP16 accuracy (within 0.3–1.5%) and nearly 2×2\times memory savings over 8-bit quantization.

6. Implementation and Deployment

The BOF4 family is designed for practical hardware and software integration:

  • Offline: Codebook and threshold computation is trivial (milliseconds), per block size, using closed-form updates or Lloyd’s.
  • Runtime encoding: Matching codebooks and threshold tables are looked up per block; quantization reduces to normalization, a binary search, and table lookup.
  • Decoding: Each block requires its scale mbm_b and 4-bit code array; reconstruction is mbx^()m_b \cdot \hat x(\ell), efficiently implemented as a multiply–accumulate kernel.
  • Memory and compute efficiency: For n=64n=64, b=8b=8 exponent bits, and m=3m=3 mantissa bits, bits/element is $4.125$, or 2×\sim2\times savings over 8-bit integer, and 4×4\times over FP16 (Trukhanov et al., 29 Mar 2024).
  • Hardware compatibility: Supports integer MAC units and fast barrel shifters; block-level layout is cache friendly; inference speed is indistinguishable from fixed-point/NF4 (Yoshida, 2023, Trukhanov et al., 29 Mar 2024).

The main deployment requirement is to ship a small lookup table of 16 floats and 15 thresholds per chosen block size.

7. Empirical Performance in Neural LLMs

BOF4 and its variants have been evaluated on large-scale LLMs (Llama-3.1 8B, Qwen-2.5 7B, Mistral 7B):

Model NF4 AF4 BOF4 (MSE) BOF4-S (MSE) BOF4-S (MSE)+OPQ
Llama-3.1 8B PPL 8.53 8.51 8.51 8.46 8.43
Qwen-2.5 7B PPL 9.89 9.91 9.94 9.88 9.83
Mistral 7B PPL 8.90 8.90 8.89 8.88 8.87

BOF4–S (with MSE optimization) dominates or matches baselines, and OPQ provides further improvement, especially for large nn. These findings hold across both synthetic and natural network distributions.


Block-wise Optimal Float (BOF4) and its recent extensions comprise the state-of-the-art in 4-bit block-wise quantization for deep learning weights and activations, balancing memory, computational efficiency, and minimal degradation in numerical accuracy, with variants delivering empirical results competitive with much higher precision floating-point baselines (Soloveychik et al., 2022, Yoshida, 2023, Trukhanov et al., 29 Mar 2024, Blumenberg et al., 10 May 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Block-wise Optimal Float (BOF4).