Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 43 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 466 tok/s Pro
Kimi K2 225 tok/s Pro
2000 character limit reached

4-bit NormalFloat (NF4) Quantization

Updated 18 August 2025
  • 4-bit NormalFloat (NF4) quantization is a low-precision representation method that allocates 16 code levels based on a normal distribution to reduce quantization error.
  • The approach employs block-wise absmax normalization to rescale weights, enabling efficient inference, training, and finetuning in large language and vision models.
  • Empirical benchmarks show NF4 achieves near-baseline accuracy with speedups up to 8.1x, balancing model size, computational efficiency, and robustness.

A 4-bit NormalFloat (NF4) quantization scheme is a low-precision quantization approach for neural networks, designed to enable highly efficient deployment and training of deep models by leveraging floating-point–like representations tailored to the statistical properties of network parameters. In contrast to standard fixed-point or integer quantization that uniformly divides the value space, NF4 allocates its representational capacity according to the empirical distribution of weights—typically the normal distribution—to minimize quantization error. This approach is now foundational to many modern techniques for compressing and accelerating large models, especially in LLMs, vision architectures, and diffusion models.

1. NF4 Principles and Mathematical Construction

The defining property of NF4 is its construction: a 4-bit numeric type whose quantization bins are placed to be “information-theoretically optimal” for normally distributed values. Rather than using uniform spacing (as in INT4) or fixed exponent/mantissa partitioning (as in FP4), NF4 places its 16 codebook levels so that each bin under a normal distribution N(0,1)N(0,1) contains equal probability mass (Dettmers et al., 2023).

Mathematically, if Q()Q(\cdot) denotes the quantile function (inverse cumulative distribution function) of N(0,1)N(0,1), the bin representative qiq_i (for i=0,,15i=0,\ldots,15) is defined as

qi=12[Q(i17)+Q(i+117)],q_i = \frac{1}{2}\big[Q\big(\frac{i}{17}\big) + Q\big(\frac{i+1}{17}\big)\big],

where the division by $17$ gives 17 boundaries for 16 bins. The codebook is then typically rescaled to fit the normalized range [1,1][-1, 1], and constructed to include an exact zero representation for sparsity or masking.

2. Block–wise Quantization: Absmax Normalization and NF4

In practice, quantization is applied block-wise (e.g., per row or per group of 64/128 values) to both limit quantization error caused by outliers and improve hardware efficiency (Dettmers et al., 2022, Yoshida, 2023, Blumenberg et al., 10 May 2025). Within each block, weights are normalized by the block's maximum absolute value (“absmax”): M=maxjwj;xj=wjMM = \max_j |w_j|; \qquad x_j = \frac{w_j}{M} NF4 quantizes xjx_j to the nearest codebook value qkq_k, and reconstructs wjMqkw_j \approx M \cdot q_k during dequantization.

While originally considered information theoretically optimal under the assumption of i.i.d. normal weights (Dettmers et al., 2023), further analysis demonstrated that the normalization procedure induces a block size–dependent distribution of xjx_j, and the Gaussian-quantile codebook is not strictly optimal for all block sizes and statistical regimes (Yoshida, 2023, Blumenberg et al., 10 May 2025).

Alternatives to NF4: BOF4, AF4, and Outlier Handling

Recent techniques such as block-wise optimal float (BOF4) (Blumenberg et al., 10 May 2025) use expectation–maximization to minimize the true quantization error (MSE\mathrm{MSE} or MAE\mathrm{MAE}) of the original weights rather than only the normalized ones. BOF4 and its signed-absmax variant (BOF4-S) may yield lower reconstruction error and perplexity in some LLMing benchmarks. Similarly, AF4 optimizes for the expected L1L_1 loss for the true post-normalized distribution, further addressing mismatches between assumed and actual data distributions.

An additional refinement, outlier-preserving quantization (OPQ), stores outlier weights in high precision and applies NF4 or BOF4 only to the majority, thus mitigating distortion of the scale caused by heavy-tailed weights (Blumenberg et al., 10 May 2025).

3. Applications: Inference, Finetuning, and Training

NF4 and its relatives are widely used in both post-training and quantization-aware (or fully quantized) training regimes:

  • LLM Inference and Finetuning: QLoRA, a memory-efficient finetuning framework, leverages NF4 to quantize the large frozen backbone model to 4 bits, enabling full 16-bit performance restoration via Low-Rank Adapters (LoRA) even on multi-billion–parameter models. This is achieved with codebook design for normal weights and double quantization of scale parameters (Dettmers et al., 2023).
  • Block-wise Quantization Scaling Laws: Large-scale experiments show that, for fixed total model bits, 4-bit quantization is nearly universally optimal—higher bitwidths waste capacity, and lower bitwidths introduce excessive error except with custom techniques (Dettmers et al., 2022).
  • Quantized Training: In both forward and backward passes, 4-bit quantization is viable when paired with stochastic rounding and strategies such as logarithmic unbiased quantization (LUQ) for unbiased gradient estimation (Chmiel et al., 2021). Fully quantized training with block-wise FP4 (e.g., E2M1 with shared higher-precision scale, as in NVFP4) enables end-to-end training of LLMs close to bfloat16 baseline performance, with theoretical thresholds delineating when quantized training “stalls” due to accumulated quantization noise (Chmiel et al., 25 May 2025).
  • Diffusion and Vision Models: Extensions of NF4 and related FP4 quantization also demonstrate effective performance in highly sensitive architectures such as diffusion models, provided weight rounding is learned and block scaling/bias are optimized (Chen et al., 13 Aug 2024, Li et al., 7 Nov 2024).

4. Hardware and Efficiency Considerations

NF4’s design is intrinsically hardware friendly:

  • The fixed codebook can be implemented as a compact lookup table per block.
  • Absmax scaling is well-suited to modern hardware with dedicated 4-bit support.
  • Group- or block-wise organization (with block sizes 16–128) minimizes the impact of outliers and enables fast, parallel kernel operations (Dettmers et al., 2022, Dettmers et al., 2023, Blumenberg et al., 10 May 2025).
  • Decompression-free custom arithmetic units for NF4 or SDR-based quantization (as in QRazor) can directly perform low-bit computations on compressed data, reducing area and power (Lee et al., 23 Jan 2025).

In practice, inference speedup and energy savings are realized due to reduced memory bandwidth and the quadratic/linear scaling of multiplier logic with bit precision (Abdolrashidi et al., 2021). On some tasks and hardware, speedups up to 8.1×8.1\times have been reported for 4-bit quantization schemes compared to full-precision inference (Zhou et al., 2023).

5. Accuracy, Trade-offs, and Limitations

Empirical results consistently show that 4-bit NF4 quantization delivers near-baseline performance in standard LLM and vision settings, with minimal loss under “normal” workloads. QLoRA, for instance, maintains 16-bit baselines for text generation; block-wise NF4 and its optimized variants are universally favored in model-size–to–accuracy scaling laws (Dettmers et al., 2022, Dettmers et al., 2023, Blumenberg et al., 10 May 2025). Integration with LoRA or SVD-based hybrid quantization increases robustness to outliers and adverse task conditions.

However, recent systematic evaluations reveal that in certain high-context or multilingual settings, aggressive 4-bit quantization may incur accuracy drops up to 59%, with nontrivial model- and task dependence (Mekala et al., 26 May 2025). Limiting factors include:

  • Sensitivity to block size and distribution: As block size increases, the effective probability mass in codebook endpoints shrinks, distorting the quantization error profile (Yoshida, 2023, Blumenberg et al., 10 May 2025).
  • Reduced robustness on long-context or non-English tasks.
  • Additional design and tuning complexity for models with severe outlier statistics or non-normal parameter distributions.

Table: Representative Accuracy Degradation for NF4/4-bit Quantization

Scenario Reported Degradation Source
ResNet-50 (ImageNet, QAT, NF4/4b) ≤ 0.32% (Chmiel et al., 2021)
LLM Zero-Shot (Scaling Laws, 4b) <1–2% (Dettmers et al., 2022)
QLoRA finetuning, LLM ≈0% (full-precision) (Dettmers et al., 2023)
Long-context retrieval (LLM, 4b) up to 59% (Mekala et al., 26 May 2025)

6. Alternatives and Emerging Developments

While NF4 is well-established, recent developments investigate adaptive, learned, or hybrid quantization:

  • The any4 method learns per-row quantizer reproduction values from weight/activation statistics, leading to better perplexity and accuracy compared to fixed-codebook NF4, especially on LLMing tasks (Elhoushi et al., 7 Jul 2025).
  • BOF4 and BOF4-S explicitly minimize true reconstruction error; outlier-preserving quantization handles heavy-tailed distributions with selective high-precision storage (Blumenberg et al., 10 May 2025).
  • Innovations in calibration (single-sample E[|x|] estimation (Elhoushi et al., 7 Jul 2025)) and fusion with plug-in LayerNorm tweaks (to align quantized activation distributions (Li et al., 2023)) further enhance the accuracy and practical usability of low-bit quantization.

Continued research is extending these methods to robust, ultra-low-bitwidth quantization (<4 bits), efficient training (sub-8-bit integer or FP4 training), and variable-precision hybrid approaches that maximize speedup and minimize accuracy loss under task- and hardware-specific constraints.

7. Summary and Research Outlook

4-bit NormalFloat (NF4) quantization provides a powerful, data-adaptive approach to low-precision neural network deployment, with a well-motivated information-theoretic foundation and substantial empirical support for its effectiveness in both inference and training. While some limitations arise in certain long-context or non-English benchmarks, the majority of research indicates that NF4-based quantization methods strike a near-optimal trade-off between model size, efficiency, and accuracy for a wide range of architectures and deployment environments. Current state-of-the-art methods refine NF4 through blockwise optimization, learned codebooks, outlier handling, and bespoke hardware alignment, while remaining an essential baseline for both large and compact models in contemporary deep learning practice.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube