AbnormalFloat-4 (AF4) Quantization for LLMs
- AF4 is a 4-bit block-wise quantization method that compresses LLM weights by mimicking floating-point codebooks with a fixed 16-entry design.
- It employs block-wise absolute-maximum normalization and a nearest-neighbor mapping strategy to minimize the mean absolute error of normalized weights.
- AF4 improves MAE and language model perplexity over NF4 for small to medium block sizes, though it is limited in MSE and outlier handling compared to BOF4.
AbnormalFloat-4 (AF4) is a 4-bit, block-wise quantization method designed to compress network weights in LLMs by mimicking floating-point codebooks at low precision. Unlike typical IEEE floating-point formats, AF4 operates as a scalar quantizer leveraging block-wise absolute-maximum normalization, with specific codebook construction to minimize the mean absolute error (MAE) of normalized weights. AF4 was introduced by Yoshida and is extensively described and analyzed in Blumenberg et al., "Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations" (Blumenberg et al., 10 May 2025).
1. Quantization Procedure and Codebook Construction
AF4 divides a flattened tensor of weights into contiguous non-overlapping blocks of size , yielding a set . For each block :
- Normalization: The block's quantization constant is , and weights are normalized as , ensuring .
- Codebook: The quantizer uses a codebook of 16 reconstruction levels, , with three constraints: , 0 (typically 1 or 2), and 3. The remaining 13 levels are optimized to minimize 4, with 5 the PDF of block-normalized weights, and 6 the piecewise constant mapping encoded via nearest neighbor assignment.
- Quantization: Each normalized weight 7 is mapped to the closest codebook entry: 8. Encoding stores 9 (4 bits) and the block scale 0 (typically bfloat16).
- Dequantization: The reconstructed value is 1.
While a uniform-step rounding interpretation (2) is possible, AF4’s codebook is non-uniform, so exact quantization uses nearest-neighbor lookup.
2. Theoretical Formulation and Error Metric
AF4’s codebook optimization is governed by minimization of the MAE on normalized weights 3, under the implicit prior 4. The likelihood 5 for post-absmax normalized weights incorporates both continuous and point masses at the block extrema 6: 7 with 8 derived by integrating over block-wise max statistics.
In each cell 9, the MAE-optimal codebook entry 0 is the weighted median of values in that cell. Weighted medians use block maxima as weights. The optimization only considers normalized weights 1; the error in original weight space 2 is suboptimal because AF4’s codebook does not minimize MAE or MSE w.r.t. 3 (i.e., 4), as the weighting by 5 is omitted.
In distinction, the BOF4 family explicitly minimizes error metrics on 6, including both MAE and mean squared error (MSE), by incorporating 7 into their loss functions for codebook construction (Blumenberg et al., 10 May 2025).
3. Block Organization, Storage, and Decoding
Blocks are typically formed by flattening each model weight matrix in row-major order and splitting into runs of size 8. This process yields 9 blocks per tensor. Each block stores:
- A single quantization constant 0 (bfloat16).
- An array of 4-bit indices (for each weight in the block).
Upon decoding, the original weight approximation for each entry is obtained by multiplying the block's stored 1 by the corresponding codebook value indexed by the quantized code.
In practice, the same codebook 2 is used for all blocks; only the scale 3 and quantized indices change block by block.
4. Empirical Behavior and Comparison to NF4 and BOF4
Quantization Error: AF4 reduces MAE compared to NormalFloat-4 (NF4), particularly for small to medium block sizes (4). However, since AF4's codebook is not optimized for MSE, its MSE exceeds that of BOF4 for larger block sizes. BOF4 and BOF4-S further reduce both MAE and MSE compared to AF4.
LLM Perplexity: On standard benchmarks with 5:
- Llama-3.1 8B: NF4 8.53, AF4 8.51, BOF4-S (MSE) 8.46, BOF4-S+OPQ 8.43.
- Qwen-2.5 7B: NF4 9.89, AF4 9.91, BOF4-S+OPQ 9.83.
- Mistral-7B: NF4 8.90, AF4 8.90, BOF4-S+OPQ 8.87.
Failure Modes: AF4, like NF4, is susceptible to a collapse of non-outlier weights in blocks with extreme outliers due to absmax scaling. In such cases, most weights are forced into a narrow interval near zero, violating the codebook's design assumptions and resulting in degraded perplexity. The outlier-preserving quantization (OPQ) method addresses this by storing outlier weights in 16 bits, preserving the block's normalized weight distribution for better codebook applicability and improved downstream performance.
5. Pseudocode for Quantization and Reconstruction
A concise, block-wise implementation of AF4 quantization and dequantization is as follows:
0
6. Strengths, Limitations, and Subsequent Developments
AF4 is designed as a 4-bit block-wise quantizer that:
- Strictly enforces exact representation for 6, 7, and 8.
- Minimizes the MAE of normalized weights under a Gaussian prior.
- Shares a fixed codebook across all blocks.
Advantages: Slight improvements in MAE and perplexity over NF4 for small to medium block sizes; simplicity and fixed codebook facilitate efficient implementation.
Limitations: Suboptimal in MSE and accuracy when block-wise outlier weights collapse most entries toward zero. Not robust to rare extreme weights due to reliance on plain absmax normalization and codebook mismatches.
Developments Beyond AF4: The BOF4 family, as analyzed by Blumenberg et al., generalizes AF4 by minimizing error metrics on 9 directly and incorporates improved scaling and outlier handling, empirically outperforming AF4 in both quantization error and language modeling perplexity (Blumenberg et al., 10 May 2025). A plausible implication is that, for long-term quantization efficiency, end-to-end error metric minimization (as in BOF4) and outlier-aware strategies (like OPQ) are necessary.