INT4 Weight-Only Quantization
- INT4 weight-only quantization is a technique that converts full-precision weights into 4-bit integers while retaining higher-precision activations, facilitating significant model compression.
- It leverages symmetric and asymmetric quantization methods along with optimization algorithms like LAPQ, COMQ, and GPTQ to minimize quantization noise.
- Practical implementations demonstrate near-baseline accuracy with reduced memory footprint and accelerated inference in tasks involving large language and vision models.
INT4 weight-only quantization refers to the post-training reduction of neural network weight tensors to 4-bit integer representations, with activations retained in higher-precision formats (typically FP16/BF16). This approach enables aggressive model compression and bandwidth reduction, facilitating deployment and inference acceleration, particularly in LLMs and vision architectures. Recent research demonstrates that, when combined with advanced calibration and compensation schemes, INT4 weight-only quantization can preserve nearly all baseline accuracy across a wide range of architectures, while incurring minimal computational and integration overhead.
1. Quantizer Formulation and Mathematical Framework
Weight-only quantization maps each weight tensor of a neural network to a grid of signed 4-bit integer values using a uniform quantizer. Let denote a full-precision (FP32/FP16) weight, and its quantized representation.
Commonly, symmetric, per-channel or per-group approaches are used:
- Symmetric quantization (per-channel):
Here, , and is an output channel (Zhang et al., 2023).
- Asymmetric quantization (per-row, unsigned):
0
Many practical systems use per-group scales, where blocks of weights (e.g., 64–256 elements) share their scale parameter, maximizing hardware efficiency (Kim et al., 2023, Kurtic et al., 2024).
2. Optimization Algorithms and Calibration Procedures
The reduction in representational bandwidth at INT4 introduces significant quantization noise. Addressing this requires loss-minimizing parameter search rather than naive range-based quantization.
1. Powell's Joint Optimization (LAPQ):
The quantizer step sizes 1 are optimized by directly minimizing the post-quantization loss: 2 where 3 is averaged over a small calibration set. LAPQ constructs a trajectory of 4-optimal scales, fits a quadratic, then applies joint Powell's method for global minimization (Nahshan et al., 2019).
2. Error-Minimizing Coordinate Descent (COMQ):
For each column (per-channel) of 5, coordinate-wise updates alternately optimize 4-bit codes 6 and scaling factors 7 in closed form: 8
9
Greedy coordinate update order on large-magnitude weights accelerates convergence (Zhang et al., 2024).
3. Second-Order, Hessian-Aware Quantization (GPTQ):
Weights are grouped (e.g., G=128) and quantized with blockwise MSE minimization, leveraging gradients and Hessians of the layer output w.r.t. weights, computed over a small calibration set (Kurtic et al., 2024, Yao et al., 2023).
4. Fine-Grained Adaptive Quantization (FineQuant):
Columns are split into blocks (typically B=64), adaptively refining group granularity when dynamic ranges shrink excessively (threshold α≈0.8–0.9) (Kim et al., 2023).
3. Loss Landscape and Non-Separability at Low Bitwidth
Empirical and theoretical analyses show that INT4 quantization noise induces pronounced layer-to-layer coupling, rendering the loss landscape non-separable. A second-order Taylor expansion quantifies this: 0 For INT8, the quadratic coupling term is negligible. At INT4, high curvature and sharp valleys necessitate multi-layer (joint) optimization (Nahshan et al., 2019). Gaussian curvature metrics (e.g., 1 for 4 b, 2 for 2 b in ResNet-18) confirm this regime shift, justifying the more sophisticated search methods employed in modern PTQ frameworks.
4. Practical Algorithms, Hardware Implementation, and Inference
A variety of pipelines have demonstrated practical near-lossless INT4 weight-only quantization:
- Per-group quantization: Per-column or blockwise scales (B=64 or 128) stored as FP16, int4 codes packed at 2 values/byte (Kim et al., 2023).
- Fused GPU kernels: On-the-fly dequantization with tensor-core GEMM—matrix multiplication accumulates activations (FP16/BF16) with int4 weights, avoiding explicit full-precision materialization (Kim et al., 2023).
- No retraining: Approaches such as FineQuant and COMQ function without fine-tuning or even calibration data, though minimal data can further improve accuracy (Kim et al., 2023, Zhang et al., 2024).
- Hardware support: Modern accelerators (NVIDIA H100) natively support int4 x fp16 GEMMs. Memory savings are 4× vs. FP16 for weights and up to 3.65× throughput improvement observed in edge cases (Kim et al., 2023, Kurtic et al., 2024).
Typical end-to-end steps:
- Extract static weights from pretrained model.
- Determine scaling factor(s) (max-abs or MSE-optimal).
- Quantize to int4 via rounding/clipping.
- Store packed codes.
- Inference via fused kernel and optional bias correction.
5. Empirical Results and Model Recovery
Recent empirical studies present a consistent pattern: INT4 weight-only quantization can achieve accuracy recovery ≈98–99% relative to full precision (FP16/BF16) across a range of model scales and tasks.
Representative empirical data:
| Model/Task | FP16 PPL | INT4 (GPTQ) PPL | INT4 Recovery |
|---|---|---|---|
| Llama-3.1-8B (V1) | — | — | 98.7% |
| Llama-3.1-70B (V1) | — | — | 99.5% |
| OPT-30B (WikiText) | 10.70 | 10.78 | Class I |
| OPT-175B | 9.08 | 9.84 (B=64) | 26% of FP16 size, <0.8 PPL gap |
| ResNet-18 (Top1) | 69.7 | 60.3 (LAPQ) | — |
| ViT-B/16 (Top1) | 84.53 | 83.86 (COMQ) | Δ=−0.67% |
*Adding low-rank error compensation (LoRC, rank 4–8) can entirely close the accuracy gap for many tasks, with <0.5% parameter and negligible runtime increase (Yao et al., 2023, Wu et al., 2023).
6. Format Comparisons, Mixed Quantizers, and Guidelines
INT4 vs FP4:
INT4 and FP4 formats show complementary performance. INT4 is preferred for uniform, small-range weights, while FP4 is better for layers with outlier-heavy distributions. Mixture-of-Formats approaches (MoFQ)—which select INT4 or FP4 per layer based on MSE—match or slightly exceed the best single-format baselines, with negligible overhead and SOTA efficiency (Zhang et al., 2023, Wu et al., 2023).
Block size and granularity:
Per-column INT4 may be sufficient for most matrices, but “catastrophic” collapse can occur in rare cases. Adaptive block splitting (α≈0.8–0.9) and block size B=64 effectively prevent these failures (Kim et al., 2023).
Deployment:
- INT4 weight-only is the top choice for synchronous, latency-sensitive workloads (chat APIs, per-query decode), providing up to 2–7× cost-efficiency compared to FP8/INT8 (Kurtic et al., 2024).
- For large-batch asynchronous tasks, 8-bit (INT8 or FP8) may outperform INT4, depending on hardware.
- Calibration: 256–512 high-quality tokens suffice for GPTQ-based pipelines.
- Kernel support: Ensure inference engines provide optimized INT4 support for best performance.
7. Best Practices and Future Directions
Best Practices:
- Use per-group (blockwise, B∼64–256) symmetric quantization for most LLMs.
- Apply loss-aware or Hessian-aware optimization (GPTQ/LAPQ) when accuracy must be maximized.
- Monitor for catastrophic failures with per-column-only INT4, especially on matrices with large outliers. Adapt block granularity as needed.
- Consider mixture quantization (MoFQ) if both INT4 and FP4 are natively supported on deployment hardware.
- Employ low-rank correction if ultra-high fidelity is needed and a tiny parameter overhead (<0.5%) is tolerable.
Research Directions:
- Further improving loss-minimizing quantization in the non-separable regime characteristic of INT4.
- Enhanced hardware support for fine-grained INT4 kernels and mixed-format inference.
- Unification of quantization strategies across both dense and MoE architectures.
- Exploration of dynamic quantization (e.g., at runtime based on input distribution) for further gains.
References: (Nahshan et al., 2019, Zhang et al., 2023, Yao et al., 2023, Wu et al., 2023, Kurtic et al., 2024, Zhang et al., 2024, Kim et al., 2023)