Papers
Topics
Authors
Recent
2000 character limit reached

4-Bit Model Quantization

Updated 21 December 2025
  • 4-bit quantization is a process that maps neural network weights, activations, and gradients into 4-bit representations, balancing compression and accuracy.
  • Advanced techniques like per-channel symmetric quantization, FP4, and non-uniform companding enable efficient deployment in LLMs, CNNs, and diffusion models.
  • State-of-the-art methods achieve up to 3.7× compute gains and 8× model compression while maintaining an accuracy drop typically below 1–2%.

4-bit quantization is the mapping of neural network weights, activations, and (in training scenarios) gradients into representations with only 4 bits per value, to optimize memory footprint and computational efficiency with minimal loss in inference or training accuracy. This precision is now regarded as a universal “sweet spot” for deep networks across domains—enabling at least 8× model compression and up to 3.7× compute throughput gains versus full precision—while typically incurring <1–2% accuracy drop even for the largest models. Research has converged around both per-channel symmetric quantization and advanced non-uniform codebook, floating-point, and companding methods, enabling state-of-the-art performance in LLMs, convolutional nets, diffusion models, RNNs, and transformers.

1. Quantization Algorithms and Numeric Formats

The core task is to map floating-point values xRx \in \mathbb{R} to quantized codes q{8,,7}q \in \{-8,\ldots,7\} (for signed integer) or to quantized floating-point codes, according to various schemes:

  • Uniform/Linear Integer Quantization: For a 4-bit signed int, with ss chosen per tensor/channel/group,

q=clip(xs,8,7),x^=sqq = \text{clip}\left(\left\lfloor\frac{x}{s}\right\rceil, -8, 7\right), \quad \hat{x} = s \cdot q

Per-channel, per-group, or per-tensor ss is standard (Abdolrashidi et al., 2021, Zhang et al., 14 Jun 2024, Lee et al., 23 Jan 2025).

  • Floating-Point (FP4) Quantization: 4 bits are split among sign/exponent/mantissa, e.g., E2M1 (2 exponent, 1 mantissa), allowing

Q(x)=sign(x)2p(x)b(1+m~(x)/2m)Q(x) = \text{sign}(x)\cdot 2^{\,p(x)-b}(1+\tilde{m}(x)/2^m)

providing enhanced dynamic range for heavy-tailed distributions, as is critical in diffusion models (Chen et al., 13 Aug 2024, Zhao et al., 27 May 2025).

  • Non-uniform and Companding Quantization: Learnable companding functions fΘf_\Theta are optimized along with network weights to assign quantization bins non-uniformly, adapting density of levels to the importance of different value ranges (Yamamoto, 2021).
  • Block and Clustered Quantization: Inputs are partitioned into blocks or clusters, each gets its own optimal codebook or scale, minimizing mean squared error. With small blocks (Lb=8L_b = 8–$32$), overhead is negligible and effective bits are 4.25–4.5 (Elangovan et al., 7 Feb 2025, Dettmers et al., 2022).
  • Balancing and Entropy Maximization: Uniform quantization wastes bins when distributions are highly peaked. Balanced quantization first equalizes the cumulative distribution (by percentiles or recursive mean split), then applies a uniform quantizer, restoring the effective bitwidth to the theoretical maximum and boosting accuracy (Zhou et al., 2017).

2. Static vs. Dynamic and Post-Training Quantization Procedures

Techniques differ in when and how calibration and quantization are performed:

3. Theoretical Motivation and Empirical Scaling Laws

Analysis across model families and architectures has established:

  • 4 bits as the optimal trade-off: For fixed total model bits (bNb\cdot N), inference scaling laws show that b=4b=4 almost universally maximizes accuracy for LLMs. Accuracy collapses below 4 bits, even at constant total bits due to excessive quantization noise (Dettmers et al., 2022):

P(N,b)αlog10N+C(b)P(N, b) \approx \alpha \log_{10} N + C(b)

with C(b)C(b) optimized for b=4b=4 across all tested scales.

4. Dealing with Outliers and Distributional Pathologies

Modern LLMs and diffusion models exhibit highly non-Gaussian, heavy-tailed, or anomalous activation and weight distributions, driving several technical advances:

  • Activation Outlier Smoothing: Preprocessing (e.g., Hadamard transforms in BitNet v2) or adaptive smoothing on outlier channels reduces dynamic range and supports accurate INT4/FP4 activation quantization (Wang et al., 25 Apr 2025, Zhang et al., 14 Jun 2024).
  • Coarse-to-Fine Preprocessing and Outlier Handling: CBQ applies quartile/IQR-based detection (coarse) and fine clustering/pruning of outliers before quantization (Ding et al., 2023). Similar two-stage schemes (e.g., QuaRTZ, QRazor) preserve both outlier precision and LSBs for fine details, critical in diffusion models and transformers (Kim et al., 30 Sep 2025, Lee et al., 23 Jan 2025).
  • Hessian-based Adaptive and Second-Order Compensation: For weight quantization, compensation methods (GPTQ style) or low-rank rounding parameterization (CBQ, AdaRound) minimize loss increase due to quantized weights, especially in outlier-dominated blocks (Zhang et al., 14 Jun 2024, Ding et al., 2023).

5. Advanced Non-uniform, Floating-Point, and Companded Methods

Integer quantization approaches are challenged by the need for dynamic range and handling non-uniform value distributions:

  • Block Clustered and Codebook Methods: BCQ/LO-BCQ assigns dedicated learned codebooks per block, enabling extremely tight mean squared error with minimal (<1%) accuracy loss in W4A4 LLM inference (Elangovan et al., 7 Feb 2025). These methods also yield low-overhead static schemes suitable for deployment.
  • Floating-Point 4-Bit Quantization (FP4): For diffusion models, FP4 with per-tensor format selection (E2M1, E1M2, E3M0) and “rounding learning” (minimizing layerwise output error with learned rounding parameters) achieves FID/perplexity nearly matching FP32, vastly outperforming INT4 in high-fidelity settings (Chen et al., 13 Aug 2024, Zhao et al., 27 May 2025).
  • Non-uniform and Companded Quantization: LCQ and logarithmic/quantile quantization adapt quantization bins non-uniformly to value density. When combined with error-feedback retraining, these approaches close the gap to full precision in translation and image recognition (Yamamoto, 2021, Aji et al., 2019).

6. Integration into Models and Practical Implementation

The deployment of 4-bit quantization is now highly optimized:

7. Hardware, Complexity, and Implementation Trade-offs

Designs are increasingly hardware-aware:


In summary, 4-bit quantization—employing integer, floating-point, block, and non-uniform schemes—enables efficient compression and acceleration with near full-precision accuracy, provided quantizer selection is tailored to distributional structure and outlier control. Research demonstrates the universality of 4 bits as the optimal practical precision for inference-time scaling, and a convergence of best practices around entropy, distribution-aware calibration, and scalable low-bit arithmetic (Dettmers et al., 2022, Elangovan et al., 7 Feb 2025, Lee et al., 23 Jan 2025, Ding et al., 2023, Zhang et al., 14 Jun 2024, Abdolrashidi et al., 2021, Zhou et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to 4-Bit Model Quantization.