NF4: 4-bit NormalFloat in Neural Quantization

Updated 2 December 2025

NF4 is a 4-bit quantization format that uses a 16-level, normal quantile-based codebook to efficiently represent neural network weights.
It employs blockwise absmax normalization to scale weights and minimize quantization error while reducing memory overhead.
Ongoing research highlights NF4's practical performance and motivates adaptive variants like AF4–B and BOF4 for further error reduction.

A 4-bit NormalFloat (NF4) is a low-bit quantization format specifically constructed for efficient neural-network weight representation, particularly in LLMs. NF4 employs a 16-level codebook based on the quantiles of a standard normal distribution and utilizes a blockwise absolute-maximum (absmax) normalization strategy. The format was introduced in the context of QLoRA, enabling 4-bit quantized finetuning of models up to 65B parameters on a single GPU, and has since become foundational in block-quantization pipelines for both inference and finetuning. Ongoing research critiques its theoretical optimality and offers data-driven variants that further minimize quantization error.

1. NF4 Encoding, Codebook Construction, and Blockwise Quantization

NF4 represents each real-valued parameter $w_{i}$ in a contiguous block of size $B$ using 4 bits, mapping to a set of 16 codepoints $q_j$ shared across all blocks. The standard quantization pipeline comprises:

Blockwise scale computation: $M = \max_{i}|w_i|$
Normalization: $x_i = w_i / M \in [-1,1]$
Assignment: $c_i = \arg\min_j|q_j - x_i|$
Storage: 4-bit index $c_i$ per entry, plus $M$ per block
Dequantization: $\hat{w}_i = M \cdot q_{c_i}$

NF4 codebook centroids $\{q_j\}$ are precomputed by partitioning the standard normal distribution $N(0,1)$ into 16 equiprobable intervals. The endpoints $q_0 = -1.0$ and $q_{15} = +1.0$ are fixed to ensure exact representation of the block maxima. The remaining quantiles are calculated as averages of adjacent cumulative-probability intervals, then renormalized to $[-1,1]$ (Dettmers et al., 2023).

Below is a schematic of the NF4 bit layout and codebook properties:

Bits	Description	Representative Values
1	Sign	$\pm$
2	Exponent (with bias)	-1, 0, 1, 2
1	Mantissa	1.0, 1.5 (or subnormal 0)

A special codeword (exp=0, mant=0, sign=0) is reserved for the exact representation of zero (Blumenberg et al., 10 May 2025).

2. Information-Theoretic Motivation and Block Size Dependence

NF4 was motivated by the argument that, for i.i.d. $N(0,\sigma^2)$ weights, partitioning the real line into equal-probability intervals (using normal quantiles) would maximize entropy and minimize MSE. However, absmax block normalization changes the effective distribution of values to be quantized; normalized weights $X_i = Z_i / M$ are no longer i.i.d. Gaussian but are instead density-concentrated around zero as $B$ increases. This means the original "information-theoretic optimality" argument for the NF4 codebook does not precisely hold outside the scalar case (Yoshida, 2023).

An alternative is the $L_1$ -optimal "AF4–B" codebook, which computes codepoints by minimizing the absolute reconstruction error for the actual normalized distribution within a block of size $B$ . At small $B$ (e.g., $B=64$ ), both codebooks perform nearly identically, while for very large $B$ the $L_1$ -optimized variant is measurably superior (Yoshida, 2023).

Block Size	Optimality of NF4	Advantage of AF4–B
$B\leq64$	Matches AF4–B	None
$B\gg64$	Suboptimal	AF4–B yields lower error

3. Quantization Error Behavior and Empirical Observations

NF4 does not satisfy Lloyd–Max centroid conditions for MSE-minimal quantization. Table 2 of (Blumenberg et al., 10 May 2025) demonstrates that, for $B=64$ and weight samples from $\mathcal{N}(0,1)$ , NF4's MAE and MSE exceed those of both $L_1$ -optimized AF4 and MSE-optimized "BOF4" blocks, with BOF4 achieving $\approx$ 12% lower MSE.

Empirical language modeling results (on LLaMA-7B/13B and others) with $B=64$ indicate negligible differences in perplexity between NF4 and the optimized alternatives. However, as block size increases, the advantage in quantization error of block-optimal designs (e.g., BOF4, AF4–B) translates into slightly better perplexity (Yoshida, 2023, Blumenberg et al., 10 May 2025).

Scheme	MAE $(B=64)$	MSE $(B=64)$	WikiText-2 Perplexity Degradation
NF4	$9.77\times 10^{-4}$	$1.64\times 10^{-6}$	Baseline
BOF4-S	$0.954\times 10^{-3}$	$1.44\times 10^{-6}$	$-$ 0.07

A plausible implication is that for ultra-large blocks (thousands of weights per block), switching to a block-size-adaptive codebook is desirable for minimizing quantization artifacts.

4. Comparisons to Alternative 4-bit Schemes

Compared to standard Int4 and low-bit IEEE-like FP4 variants, NF4 offers:

Superior near-zero quantization resolution due to nonuniform bin placements.
Exact representation of block maxima/minima and zero.
Narrower representable range—tails are more aggressively clipped than in int-based or float-based schemes.

Empirical results demonstrate that NF4 achieves lower perplexity than Int4 and FP4 on benchmark datasets, both in static quantization and after adapter-based finetuning (GLUE, MMLU, etc.), recovering full FP16 performance (Dettmers et al., 2023). This underpins its widespread adoption in QLoRA and related LLM pipelines.

Format	Value Range (per block)	Zero-shot PPL (Pile-CC)	Key Drawback
Int4	$[-c,+c]$ uniform	34.34	Coarse near zero
FP4 (E2M1)	$\approx\pm 2c$	31.07	Coarse quantization
FP4 (E3M0)	$\approx\pm 8c$	29.48	No fractional bits
NF4	$[-c,+c]$ coded	27.41	Aggressive tail clipping

5. Double Quantization, Outlier Handling, and Implementation

NF4 in QLoRA further minimizes memory via "double quantization": block scales $c$ are quantized using FP8 floats in secondary blocks (of e.g., 256 scales/block), reducing scale storage to $\approx 0.127$ bits/weight at $B=64$ (Dettmers et al., 2023). All main MAC operations are performed in higher precision (BF16/FP16) after dequantization.

BOF4-S and OPQ (Blumenberg et al., 10 May 2025) build directly on NF4 principles. BOF4 utilizes Lloyd/EM algorithms for codebook optimization to minimize end-to-end error, and the signed-absmax normalization (BOF4-S) further reduces error by fixing only the positive endpoint. OPQ (outlier-preserving quantization) stores rare outliers unquantized (e.g., $<$ 1% in BF16), mitigating the bias induced by extreme values on the scale. This combination further lowers perplexity degradation with negligible memory cost.

6. Deployment Guidelines and Practical Recommendations

NF4 is recommended for block sizes up to $B=64$ (default in QLoRA and related toolchains), balancing implementation simplicity, hardware compatibility, and negligible loss relative to theoretically optimal alternatives. For quantizing very large blocks, block-tailored codebooks (AF4–B, BOF4) are preferred (Yoshida, 2023, Blumenberg et al., 10 May 2025).

Best practices for high-fidelity quantization in LLMs:

Use block size $B=64$ for weights; store scales in FP8 with double quantization.
If aiming for minimal quantization error, apply BOF4-S (MSE-optimized) and enable OPQ at $q\approx0.95$ .
Avoid equalizing codebook bin counts, which empirically degrades performance (Yoshida, 2023).
Use standard NF4 for compatibility and performance in standard QLoRA pipelines unless specifically targeting ultra-large blocks.

7. Limitations, Ongoing Work, and Theoretical Considerations

NF4 is not information-theoretically optimal for blockwise quantization due to the non-i.i.d. nature of normalized weights post block scaling. Scalar quantization theory (Lloyd–Max) prescribes codebooks adapted to the empirical distribution of normalized weights, motivating recent research into data-driven and block-size-adaptive codebooks. Outlier-aware and mixed-precision quantization schemes (such as OPQ) exploit neural weight distributions to robustly quantize even when extreme values are present.

These developments reflect a trend toward increasingly sophisticated 4-bit quantization strategies, optimizing for a variety of criteria (MSE, MAE, outlier preservation) and deployment constraints (memory, compute, calibration-free operation) in the context of large-scale neural networks (Dettmers et al., 2023, Yoshida, 2023, Blumenberg et al., 10 May 2025).

Principal references:

"QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
"NF4 Isn't Information Theoretically Optimal (and that's Good)" (Yoshida, 2023)
"Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations" (Blumenberg et al., 10 May 2025)