Papers
Topics
Authors
Recent
Search
2000 character limit reached

INT4 Weight-Only Quantization

Updated 2 April 2026
  • INT4 weight-only quantization is a technique that converts full-precision weights into 4-bit integers while retaining higher-precision activations, facilitating significant model compression.
  • It leverages symmetric and asymmetric quantization methods along with optimization algorithms like LAPQ, COMQ, and GPTQ to minimize quantization noise.
  • Practical implementations demonstrate near-baseline accuracy with reduced memory footprint and accelerated inference in tasks involving large language and vision models.

INT4 weight-only quantization refers to the post-training reduction of neural network weight tensors to 4-bit integer representations, with activations retained in higher-precision formats (typically FP16/BF16). This approach enables aggressive model compression and bandwidth reduction, facilitating deployment and inference acceleration, particularly in LLMs and vision architectures. Recent research demonstrates that, when combined with advanced calibration and compensation schemes, INT4 weight-only quantization can preserve nearly all baseline accuracy across a wide range of architectures, while incurring minimal computational and integration overhead.

1. Quantizer Formulation and Mathematical Framework

Weight-only quantization maps each weight tensor WW of a neural network to a grid of signed 4-bit integer values using a uniform quantizer. Let ww denote a full-precision (FP32/FP16) weight, and qq its quantized representation.

Commonly, symmetric, per-channel or per-group approaches are used:

  • Symmetric quantization (per-channel):

Δc=maxiWc,i231\Delta_c = \frac{\max_{i} |W_{c,i}|}{2^{3}-1}

q=clip(round(wΔc),7,7)q = \text{clip}\left(\text{round}\left(\frac{w}{\Delta_c}\right), -7, 7 \right)

w=qΔcw' = q \Delta_c

Here, WRC×KW \in \mathbb{R}^{C \times K}, and cc is an output channel (Zhang et al., 2023).

  • Asymmetric quantization (per-row, unsigned):

Si=wmax(i)wmin(i)15,Zi=wmin(i)S_i = \frac{w_{max}^{(i)} - w_{min}^{(i)}}{15}, \quad Z_i = w_{min}^{(i)}

qij=clip(round(wijZiSi),0,15)q_{ij} = \text{clip}\left(\text{round}\left(\frac{w_{ij} - Z_i}{S_i}\right),0,15\right)

ww0

(Yao et al., 2023).

Many practical systems use per-group scales, where blocks of weights (e.g., 64–256 elements) share their scale parameter, maximizing hardware efficiency (Kim et al., 2023, Kurtic et al., 2024).

2. Optimization Algorithms and Calibration Procedures

The reduction in representational bandwidth at INT4 introduces significant quantization noise. Addressing this requires loss-minimizing parameter search rather than naive range-based quantization.

1. Powell's Joint Optimization (LAPQ):

The quantizer step sizes ww1 are optimized by directly minimizing the post-quantization loss: ww2 where ww3 is averaged over a small calibration set. LAPQ constructs a trajectory of ww4-optimal scales, fits a quadratic, then applies joint Powell's method for global minimization (Nahshan et al., 2019).

2. Error-Minimizing Coordinate Descent (COMQ):

For each column (per-channel) of ww5, coordinate-wise updates alternately optimize 4-bit codes ww6 and scaling factors ww7 in closed form: ww8

ww9

Greedy coordinate update order on large-magnitude weights accelerates convergence (Zhang et al., 2024).

3. Second-Order, Hessian-Aware Quantization (GPTQ):

Weights are grouped (e.g., G=128) and quantized with blockwise MSE minimization, leveraging gradients and Hessians of the layer output w.r.t. weights, computed over a small calibration set (Kurtic et al., 2024, Yao et al., 2023).

4. Fine-Grained Adaptive Quantization (FineQuant):

Columns are split into blocks (typically B=64), adaptively refining group granularity when dynamic ranges shrink excessively (threshold α≈0.8–0.9) (Kim et al., 2023).

3. Loss Landscape and Non-Separability at Low Bitwidth

Empirical and theoretical analyses show that INT4 quantization noise induces pronounced layer-to-layer coupling, rendering the loss landscape non-separable. A second-order Taylor expansion quantifies this: qq0 For INT8, the quadratic coupling term is negligible. At INT4, high curvature and sharp valleys necessitate multi-layer (joint) optimization (Nahshan et al., 2019). Gaussian curvature metrics (e.g., qq1 for 4 b, qq2 for 2 b in ResNet-18) confirm this regime shift, justifying the more sophisticated search methods employed in modern PTQ frameworks.

4. Practical Algorithms, Hardware Implementation, and Inference

A variety of pipelines have demonstrated practical near-lossless INT4 weight-only quantization:

  • Per-group quantization: Per-column or blockwise scales (B=64 or 128) stored as FP16, int4 codes packed at 2 values/byte (Kim et al., 2023).
  • Fused GPU kernels: On-the-fly dequantization with tensor-core GEMM—matrix multiplication accumulates activations (FP16/BF16) with int4 weights, avoiding explicit full-precision materialization (Kim et al., 2023).
  • No retraining: Approaches such as FineQuant and COMQ function without fine-tuning or even calibration data, though minimal data can further improve accuracy (Kim et al., 2023, Zhang et al., 2024).
  • Hardware support: Modern accelerators (NVIDIA H100) natively support int4 x fp16 GEMMs. Memory savings are 4× vs. FP16 for weights and up to 3.65× throughput improvement observed in edge cases (Kim et al., 2023, Kurtic et al., 2024).

Typical end-to-end steps:

  1. Extract static weights from pretrained model.
  2. Determine scaling factor(s) (max-abs or MSE-optimal).
  3. Quantize to int4 via rounding/clipping.
  4. Store packed codes.
  5. Inference via fused kernel and optional bias correction.

5. Empirical Results and Model Recovery

Recent empirical studies present a consistent pattern: INT4 weight-only quantization can achieve accuracy recovery ≈98–99% relative to full precision (FP16/BF16) across a range of model scales and tasks.

Representative empirical data:

Model/Task FP16 PPL INT4 (GPTQ) PPL INT4 Recovery
Llama-3.1-8B (V1) 98.7%
Llama-3.1-70B (V1) 99.5%
OPT-30B (WikiText) 10.70 10.78 Class I
OPT-175B 9.08 9.84 (B=64) 26% of FP16 size, <0.8 PPL gap
ResNet-18 (Top1) 69.7 60.3 (LAPQ)
ViT-B/16 (Top1) 84.53 83.86 (COMQ) Δ=−0.67%

*Adding low-rank error compensation (LoRC, rank 4–8) can entirely close the accuracy gap for many tasks, with <0.5% parameter and negligible runtime increase (Yao et al., 2023, Wu et al., 2023).

6. Format Comparisons, Mixed Quantizers, and Guidelines

INT4 vs FP4:

INT4 and FP4 formats show complementary performance. INT4 is preferred for uniform, small-range weights, while FP4 is better for layers with outlier-heavy distributions. Mixture-of-Formats approaches (MoFQ)—which select INT4 or FP4 per layer based on MSE—match or slightly exceed the best single-format baselines, with negligible overhead and SOTA efficiency (Zhang et al., 2023, Wu et al., 2023).

Block size and granularity:

Per-column INT4 may be sufficient for most matrices, but “catastrophic” collapse can occur in rare cases. Adaptive block splitting (α≈0.8–0.9) and block size B=64 effectively prevent these failures (Kim et al., 2023).

Deployment:

  • INT4 weight-only is the top choice for synchronous, latency-sensitive workloads (chat APIs, per-query decode), providing up to 2–7× cost-efficiency compared to FP8/INT8 (Kurtic et al., 2024).
  • For large-batch asynchronous tasks, 8-bit (INT8 or FP8) may outperform INT4, depending on hardware.
  • Calibration: 256–512 high-quality tokens suffice for GPTQ-based pipelines.
  • Kernel support: Ensure inference engines provide optimized INT4 support for best performance.

7. Best Practices and Future Directions

Best Practices:

  • Use per-group (blockwise, B∼64–256) symmetric quantization for most LLMs.
  • Apply loss-aware or Hessian-aware optimization (GPTQ/LAPQ) when accuracy must be maximized.
  • Monitor for catastrophic failures with per-column-only INT4, especially on matrices with large outliers. Adapt block granularity as needed.
  • Consider mixture quantization (MoFQ) if both INT4 and FP4 are natively supported on deployment hardware.
  • Employ low-rank correction if ultra-high fidelity is needed and a tiny parameter overhead (<0.5%) is tolerable.

Research Directions:

  • Further improving loss-minimizing quantization in the non-separable regime characteristic of INT4.
  • Enhanced hardware support for fine-grained INT4 kernels and mixed-format inference.
  • Unification of quantization strategies across both dense and MoE architectures.
  • Exploration of dynamic quantization (e.g., at runtime based on input distribution) for further gains.

References: (Nahshan et al., 2019, Zhang et al., 2023, Yao et al., 2023, Wu et al., 2023, Kurtic et al., 2024, Zhang et al., 2024, Kim et al., 2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to INT4 Weight-Only Quantization.