4-Bit Model Quantization
- 4-bit quantization is a process that maps neural network weights, activations, and gradients into 4-bit representations, balancing compression and accuracy.
- Advanced techniques like per-channel symmetric quantization, FP4, and non-uniform companding enable efficient deployment in LLMs, CNNs, and diffusion models.
- State-of-the-art methods achieve up to 3.7× compute gains and 8× model compression while maintaining an accuracy drop typically below 1–2%.
4-bit quantization is the mapping of neural network weights, activations, and (in training scenarios) gradients into representations with only 4 bits per value, to optimize memory footprint and computational efficiency with minimal loss in inference or training accuracy. This precision is now regarded as a universal “sweet spot” for deep networks across domains—enabling at least 8× model compression and up to 3.7× compute throughput gains versus full precision—while typically incurring <1–2% accuracy drop even for the largest models. Research has converged around both per-channel symmetric quantization and advanced non-uniform codebook, floating-point, and companding methods, enabling state-of-the-art performance in LLMs, convolutional nets, diffusion models, RNNs, and transformers.
1. Quantization Algorithms and Numeric Formats
The core task is to map floating-point values to quantized codes (for signed integer) or to quantized floating-point codes, according to various schemes:
- Uniform/Linear Integer Quantization: For a 4-bit signed int, with chosen per tensor/channel/group,
Per-channel, per-group, or per-tensor is standard (Abdolrashidi et al., 2021, Zhang et al., 14 Jun 2024, Lee et al., 23 Jan 2025).
- Floating-Point (FP4) Quantization: 4 bits are split among sign/exponent/mantissa, e.g., E2M1 (2 exponent, 1 mantissa), allowing
providing enhanced dynamic range for heavy-tailed distributions, as is critical in diffusion models (Chen et al., 13 Aug 2024, Zhao et al., 27 May 2025).
- Non-uniform and Companding Quantization: Learnable companding functions are optimized along with network weights to assign quantization bins non-uniformly, adapting density of levels to the importance of different value ranges (Yamamoto, 2021).
- Block and Clustered Quantization: Inputs are partitioned into blocks or clusters, each gets its own optimal codebook or scale, minimizing mean squared error. With small blocks (–$32$), overhead is negligible and effective bits are 4.25–4.5 (Elangovan et al., 7 Feb 2025, Dettmers et al., 2022).
- Balancing and Entropy Maximization: Uniform quantization wastes bins when distributions are highly peaked. Balanced quantization first equalizes the cumulative distribution (by percentiles or recursive mean split), then applies a uniform quantizer, restoring the effective bitwidth to the theoretical maximum and boosting accuracy (Zhou et al., 2017).
2. Static vs. Dynamic and Post-Training Quantization Procedures
Techniques differ in when and how calibration and quantization are performed:
- Static per-channel quantization is now common for transformers/LLMs: weight and activation scales are computed once per-channel or per-head on calibration data, enabling fully static integer-only inference (Wang et al., 7 Mar 2025, Lee et al., 23 Jan 2025).
- Dynamic or per-token quantization, recalibrating quantization parameters online, remains more accurate for highly dynamic data but is expensive for long-form generation. Migration (e.g., MergeQuant's Quantization Step Migration) absorbs dynamic quant/dequant costs into adjacent modules (e.g., RMSNorm) and relies entirely on static INT4 kernels (Wang et al., 7 Mar 2025).
- Post-Training Quantization (PTQ) approaches such as QuaRTZ use calibration sets and multi-stage quantization (e.g., 8-bit min–max followed by leading-zero suppression for 4-bit) to compress models without fine-tuning (Kim et al., 30 Sep 2025).
- Quantization-Aware Training (QAT): Training is performed in a simulated quantized regime, typically via the straight-through estimator for quantization gradients (Abdolrashidi et al., 2021, Elangovan et al., 7 Feb 2025, Ding et al., 2022, Fasoli et al., 2021).
3. Theoretical Motivation and Empirical Scaling Laws
Analysis across model families and architectures has established:
- 4 bits as the optimal trade-off: For fixed total model bits (), inference scaling laws show that almost universally maximizes accuracy for LLMs. Accuracy collapses below 4 bits, even at constant total bits due to excessive quantization noise (Dettmers et al., 2022):
with optimized for across all tested scales.
- Block size impacts effective precision: Small blocks/local codebooks (e.g., ) are necessary at 4 bits to contain outliers and achieve optimal perplexity (Dettmers et al., 2022, Elangovan et al., 7 Feb 2025).
- Entropy maximization via balanced bins: Ensuring each quantization level is equally populated recovers the full potential of 4 bits and correlates strongly with test accuracy (Zhou et al., 2017).
- Empirical accuracy drops: In classification, typical loss is for ResNets and MobileNet-V2 (Abdolrashidi et al., 2021, Yamamoto, 2021). For LLMs, state-of-the-art PTQ schemes achieve to drop in MMLU/LM harness/zero-shot accuracy at W4A4 (Elangovan et al., 7 Feb 2025, Lee et al., 23 Jan 2025, Ding et al., 2023).
4. Dealing with Outliers and Distributional Pathologies
Modern LLMs and diffusion models exhibit highly non-Gaussian, heavy-tailed, or anomalous activation and weight distributions, driving several technical advances:
- Activation Outlier Smoothing: Preprocessing (e.g., Hadamard transforms in BitNet v2) or adaptive smoothing on outlier channels reduces dynamic range and supports accurate INT4/FP4 activation quantization (Wang et al., 25 Apr 2025, Zhang et al., 14 Jun 2024).
- Coarse-to-Fine Preprocessing and Outlier Handling: CBQ applies quartile/IQR-based detection (coarse) and fine clustering/pruning of outliers before quantization (Ding et al., 2023). Similar two-stage schemes (e.g., QuaRTZ, QRazor) preserve both outlier precision and LSBs for fine details, critical in diffusion models and transformers (Kim et al., 30 Sep 2025, Lee et al., 23 Jan 2025).
- Hessian-based Adaptive and Second-Order Compensation: For weight quantization, compensation methods (GPTQ style) or low-rank rounding parameterization (CBQ, AdaRound) minimize loss increase due to quantized weights, especially in outlier-dominated blocks (Zhang et al., 14 Jun 2024, Ding et al., 2023).
5. Advanced Non-uniform, Floating-Point, and Companded Methods
Integer quantization approaches are challenged by the need for dynamic range and handling non-uniform value distributions:
- Block Clustered and Codebook Methods: BCQ/LO-BCQ assigns dedicated learned codebooks per block, enabling extremely tight mean squared error with minimal (<1%) accuracy loss in W4A4 LLM inference (Elangovan et al., 7 Feb 2025). These methods also yield low-overhead static schemes suitable for deployment.
- Floating-Point 4-Bit Quantization (FP4): For diffusion models, FP4 with per-tensor format selection (E2M1, E1M2, E3M0) and “rounding learning” (minimizing layerwise output error with learned rounding parameters) achieves FID/perplexity nearly matching FP32, vastly outperforming INT4 in high-fidelity settings (Chen et al., 13 Aug 2024, Zhao et al., 27 May 2025).
- Non-uniform and Companded Quantization: LCQ and logarithmic/quantile quantization adapt quantization bins non-uniformly to value density. When combined with error-feedback retraining, these approaches close the gap to full precision in translation and image recognition (Yamamoto, 2021, Aji et al., 2019).
6. Integration into Models and Practical Implementation
The deployment of 4-bit quantization is now highly optimized:
- Transformers & LLMs: MergeQuant, QRazor, BCQ, CBQ, and QQQ provide state-of-the-art PTQ and fast static schemes with per-channel calibration, codebook migration, and dedicated INT4 or FP4 GEMM kernels, consistently yielding 1.8–2.7× throughput increases and compressing Llama-2/3 to <1–2% accuracy loss (Wang et al., 7 Mar 2025, Lee et al., 23 Jan 2025, Elangovan et al., 7 Feb 2025, Ding et al., 2023, Zhang et al., 14 Jun 2024).
- CNNs & Mobile Architectures: Methods such as adaptive shift+scale, bias correction, per-channel bit allocation, and learnable companders are critical for retaining state-of-the-art accuracy on ImageNet and COCO in ResNet, MobileNet-V2, EfficientNet, and YOLO pipelines (Yamamoto, 2021, Chin et al., 2021, Abdolrashidi et al., 2021, Banner et al., 2018).
- Speech and Sequence Models: LSTM and Conformer ASR architectures show that 4-bit QAT with per-layer quantizer selection (statistical, adaptive, learnable clipping), selective mixed-precision for sensitive layers, and integer-native operators support lossless performance and up to 5–7× compression (Fasoli et al., 2021, Fasoli et al., 2022, Ding et al., 2022).
- Diffusion Models: QuaRTZ, FP4/PTQ with rounding learning, and mixup-sign FP4 quantization empirically outperform INT4 schemes by large FID and sFID margins, critical for high-fidelity image synthesis (Chen et al., 13 Aug 2024, Kim et al., 30 Sep 2025, Zhao et al., 27 May 2025).
7. Hardware, Complexity, and Implementation Trade-offs
Designs are increasingly hardware-aware:
- Arithmetic and GEMM Kernels: INT4 and FP4 matrix multiply kernels are now available on modern GPUs/TPUs, with custom fused dequant, per-channel scale support, and block-wise 'salient bit' operations (QRazor, QuaRTZ) (Lee et al., 23 Jan 2025, Kim et al., 30 Sep 2025, Zhang et al., 14 Jun 2024).
- Efficiency Gains: INT4 accelerators yield 2–3.7× speed increase over FP16, 8× memory reduction, 60% area/power reduction vs. INT8, and additional gains with quantized KV cache and per-layer block strategies (Zhang et al., 14 Jun 2024, Lee et al., 23 Jan 2025).
- Algorithmic Complexity: Modern balanced and clustered quantizers incur negligible (few % or less) training-time overhead. All preprocessing can be performed offline, and GEMM operators execute at pure integer/floating-point, scale-fused speed (Zhou et al., 2017, Elangovan et al., 7 Feb 2025, Wang et al., 7 Mar 2025, Lee et al., 23 Jan 2025, Chen et al., 13 Aug 2024).
- Inference Deployment: Static schemes (QRazor, MergeQuant, LO-BCQ) remove reliance on per-token dynamic quantization and batch normalization, enabling batched inference and real-time LLM and diffusion generation (Wang et al., 7 Mar 2025, Lee et al., 23 Jan 2025, Elangovan et al., 7 Feb 2025).
In summary, 4-bit quantization—employing integer, floating-point, block, and non-uniform schemes—enables efficient compression and acceleration with near full-precision accuracy, provided quantizer selection is tailored to distributional structure and outlier control. Research demonstrates the universality of 4 bits as the optimal practical precision for inference-time scaling, and a convergence of best practices around entropy, distribution-aware calibration, and scalable low-bit arithmetic (Dettmers et al., 2022, Elangovan et al., 7 Feb 2025, Lee et al., 23 Jan 2025, Ding et al., 2023, Zhang et al., 14 Jun 2024, Abdolrashidi et al., 2021, Zhou et al., 2017).