INT4 Quantization in Deep Learning
- INT4 quantization is a low-bit representation technique that maps floating-point values to 16 discrete levels using 4-bit integers to balance efficiency and accuracy.
- Advanced post-training and quantization-aware methods calibrate scaling factors, reduce quantization error, and maintain model performance for LLMs and vision applications.
- Techniques such as group-wise scaling, outlier smoothing, and block-wise optimization enable robust inference on edge hardware with significant memory and throughput gains.
INT4 quantization refers to the representation and computation of neural network parameters (weights, activations, gradients, or optimizer states) using 4-bit signed or unsigned integer formats. This aggressive quantization regime enables substantial reductions in memory usage, throughput improvements, and hardware simplification at the cost of limited dynamic range and increased quantization error. Within the deep learning systems literature, INT4 quantization is deployed for model inference and, increasingly, for training or adaptation. Advanced post-training and quantization-aware training methodologies, architectural innovations, and calibration protocols have been developed to mitigate the quantization-induced accuracy loss, broadening the applicability of INT4 quantization to LLMs, vision models, and edge scenarios.
1. Mathematical Formulations and Core Algorithms
The foundation of INT4 quantization is uniform linear mapping of floating-point values to the nearest element in a discrete set of 16 levels, typically (symmetric) or (unsigned/asymmetric), with an associated scale and an optional zero-point . The quantization function is: For affine quantizers, , with .
Post-training quantization (PTQ) typically estimates appropriate scale and (if asymmetric) zero-point per tensor, per channel, or per block. For group-wise quantization (group size is standard), the scaling factor for group is: 0 with quantization performed independently on each group. Block-wise PTQ, as in FlexRound or PolarQuant, further reduces quantization error by coupling weight matrices and input activations over blocks, optimizing quantization parameters via calibration batches: 1 where 2 are cached activations.
Quantization-aware training (QAT) integrates fake-quantize operations as differentiable (via STE) surrogates in the computation graph: 3 where Q denotes quantization modules inserted on weights/activations, and gradients are propagated through the discretization step.
Advanced schemes address the limited expressivity of INT4 using:
- Lloyd–Max quantizers tailored to weight (Gaussian) distributions, as in PolarQuant (Vicentino, 30 Mar 2026)
- Layer or group-wise non-uniform quantization (AWQ, GPTQ) (Mekala et al., 26 May 2025)
- Channel- and block-wise scaling, outlier suppression, and Hadamard or other rotations to spread quantization distortion (Yi et al., 2024, Vicentino, 30 Mar 2026, 2505.20839)
- Matryoshka bit slicing (MatQuant), permitting bit-width-adaptive deployment from a single checkpoint (Nair et al., 10 Feb 2025)
2. Representative INT4 Quantization Pipelines and Adaptations
2.1 Weight-Only Quantization
Weight-only INT4 (W4A16) is widely adopted in LLM deployment due to lower sensitivity. The workflow is:
- Partition each weight tensor into groups (e.g., size 128).
- For each group: estimate the scaling parameter based on group max-norm or MSE-optimal clipping (Kurtic et al., 2024).
- Map FP32/BF16 weights to 4-bit signed integers; store scales per group for dequantization during inference.
- Activations typically remain in 16-bit (or 8-bit) for resilience to input and hardware outliers.
Accuracy can be within 1–2% of full precision given proper group size, calibration data, and advanced correction steps (e.g., GPTQ Newton refinement) (Kurtic et al., 2024).
2.2 Weight-Activation INT4 (W4A4) and Activation Smoothing
Full INT4 (W4A4) quantizes both weights and activations, offering the largest speedups and footprint reductions. Key methods:
- Symmetric or asymmetric quantization with per-channel scales for weights; per-tensor scale for activations if sufficiently unskewed (Wu et al., 2023).
- Training-free activation “smoothing” deals with channel-wise and spike outliers using methods such as Rotated Runtime Smooth (Hadamard transformation + group-wise normalization) for near-lossless performance (Yi et al., 2024).
- Mixed-precision schemes (e.g., INT4 weights, FP8 activations) permit full tensor-core acceleration without the severe accuracy penalty of W4A4 in attention layers (2505.20839, Zhang et al., 2024).
2.3 Progressive, Block-wise, and Function-Preserving INT4
Block-wise PTQ extends traditional PTQ from individual tensors to blocks (layer groups) for joint optimization, reducing cumulative error. Approaches:
- Define blocks (e.g., all projections in a sub-layer).
- Minimize block-level output error on calibration samples.
- Store per-channel scales; quantization targets MSE, not just static tensor range (Lee et al., 10 Jun 2025).
Function-preserving schemes (e.g., SplitQuantV2) split layers into quantization-friendly substructures using unsupervised clustering, then quantize each sub-layer, thus recovering accuracy for CPU/NPU deployment without calibration or GPUs (Song et al., 7 Mar 2025).
2.4 Quantized Adaptation and Training
- Integration with low-rank adaptation: QA-LoRA fuses INT4 group-wise quantized weights with trainable LoRA adapters, enabling integer-only downstream fine-tuning and INT4-only inference (Xu et al., 2023).
- INT4 gradient and optimizer quantization for end-to-end training: Q-GaLore projects gradients to low-rank INT4 subspaces, storing INT8 weights and maintaining accuracy with stochastic rounding (Zhang et al., 2024). Adaptive gradient interval quantization focuses error bound control on large-gradient tails (Kim et al., 2024).
2.5 Multi-format and Nested Quantization
Layer-wise selection between INT4 and alternative FP4 via per-layer MSE optimization (MoFQ) yields consistent outperforming of pure INT4/PTQ, exploiting the non-uniform distribution of errors across layers (Zhang et al., 2023). Matryoshka Quantization exploits nested bit-structures to train once (INT8 checkpoint) and extract high-quality INT4 models via bit slicing at deployment (Nair et al., 10 Feb 2025).
3. Empirical Accuracy, Hardware Performance, and Limitations
| INT4 Pipeline | Typical PPL/Accuracy Gap vs FP | Throughput Gain | Key Notes |
|---|---|---|---|
| W4A16-INT PTQ (GPTQ) | ≤2% main benchmarks (Kurtic et al., 2024) | 2–3x GPU, >3x CPU | Sensitive to group size, calibration set |
| W4A4 (with smoothing) | ≤0.5% encoder/decoder, ≫3% decoder-only (Wu et al., 2023, Yi et al., 2024) | 8.5x (BERT INT4) | Outlier smoothing essential; pure W4A4 unsuitable for decoder-only LMs |
| Block-wise PTQ | 1–2 pt drop (MMLU, IFEval) (Lee et al., 10 Jun 2025) | ~2x INT8 | Joint input-output MSE minimization |
| SplitQuantV2 | INT4: +11.76 pp over baseline; matches FP32 (Song et al., 7 Mar 2025) | CPU-only | Requires no calibration data |
| PolarQuant→INT4 | ΔPPL=+0.19 to FP16 (Qwen3.5-9B) (Vicentino, 30 Mar 2026) | 43 tok/s, 6.5 GB VRAM | No calibration needed; Hadamard rotation dominant |
| QA-LoRA INT4 (adapt) | Matches or exceeds QLoRA-PTQ (4+16) (Xu et al., 2023) | ≥50% faster than QLoRA | Direct INT4 fusion, group pooling |
| Long-context tasks | –1.8% (AWQ), –2.7% (GPTQ), –6.9% (BNB-nf4) (Mekala et al., 26 May 2025) | Losses increase for retrieval, languages other than English, small models | |
| Mixed-precision/MatQuant | Task gaps <0.5% (Nair et al., 10 Feb 2025, Zhang et al., 2023) | As above | Supports deployment-wide trade-offs |
INT4 quantization offers 60–68% model size and cost reductions and up to 3× throughput gains on edge hardware (Hasan, 2024). Power usage drops by up to 60%. Reduction in latency and memory is significant for serving LLMs, especially in synchronous settings and on mid-tier GPUs (Kurtic et al., 2024). However, INT4 accuracy may degrade sharply in long-context or multilingual scenarios if calibration and outlier control are insufficient (Mekala et al., 26 May 2025).
4. Special Considerations and Best Practices
- Always employ group-wise or block-wise scaling for weights; symmetric quantization is preferred for hardware simplicity; per-channel scaling further improves fidelity (Hasan, 2024, Kurtic et al., 2024).
- Deploy outlier-smoothing (runtime or offline) for activations in W4A4; Hadamard-based rotations (PolarQuant, RRS) or mean-subtraction (attention INT4) are highly effective (Yi et al., 2024, Vicentino, 30 Mar 2026, Zhang et al., 2024).
- Calibration data must be high quality and representative of deployment context; for long-context LLMs, calibration at long context length is essential; for multilingual deployment, include all target languages (Mekala et al., 26 May 2025, Kurtic et al., 2024).
- For mixed-precision strategies or extremely low bit-widths, use progressive PTQ with block-level calibration, followed by distillation-based QAT for robust INT2 or hybrid deployment (Lee et al., 10 Jun 2025, Nair et al., 10 Feb 2025).
- For inference hardware, INT4 matrix multiplication cores are standard in recent architectures; both integer (INT4) and FP4/FP8 are supported on current and upcoming accelerators (Zhang et al., 2023, 2505.20839).
5. Advanced and Emerging Methodologies
- PolarQuant demonstrates near-lossless INT4 performance using block-wise Hadamard rotation and Lloyd–Max quantization, serving as an effective preprocessing step for INT4 pipelines (Vicentino, 30 Mar 2026).
- SageAttention2 leverages per-thread INT4 quantization for attention Q and K, with mean-centering and a hybrid FP8/FP32 accumulation scheme, achieving ≈3× kernel throughput versus prior state-of-the-art FlashAttention2 at negligible accuracy loss (Zhang et al., 2024).
- FireQ introduces INT4/FP8 hybrid kernels, two-stage outlier smoothing (absmean scaling, RoPE-aware normalization), and FlashAttention pipeline optimization, outperforming previous solutions for LLM inference acceleration (2505.20839).
- SplitQuantV2 and related function-preserving approaches partition layers via k-means clustering, enabling high-accuracy INT4 without GPU/large calibration demand, particularly vital for edge and NPU deployment scenarios (Song et al., 7 Mar 2025).
- Block-wise PTQ (FlexRound, OmniQuant) underpins state-of-the-art pipelines for progressive (e.g., INT4→INT2) quantization in instruction-tuned and general LLMs (Lee et al., 10 Jun 2025).
6. Limitations, Failure Modes, and Deployment Recommendations
- Decoder-only LLMs are highly sensitive to INT4 W4A4; hybrid precision (e.g., keeping key attention or embedding layers at higher bit-width), activation smoothing, and careful calibration are required for robust deployment (Wu et al., 2023, Yi et al., 2024).
- Long-context reasoning and multilingual tasks are particularly vulnerable to cumulative quantization error; mixed-precision fallback for critical layers and calibration with in-domain long-context data are crucial (Mekala et al., 26 May 2025).
- Not all model architectures tolerate INT4 equally; larger LLMs are more robust, while smaller models may see >3% performance degradation on complex benchmarks (Kurtic et al., 2024).
- INT4 inference is limited by input/output data movement if activations remain at higher bit-widths; end-to-end INT4 is hardware feasible but often not accuracy-optimal without aggressive outlier control.
- For compressed training (optimizer states, gradients), stochastic rounding and adaptive error control must be used to avoid optimization stagnation (Zhang et al., 2024, Kim et al., 2024).
- Progressive quantization with Matryoshka or multi-format quantization provides hardware transparency while maximizing model quality (Nair et al., 10 Feb 2025, Zhang et al., 2023).
7. Outlook and Ongoing Research
- Continued research aims to make INT4 quantization robust without retraining or large calibration (e.g., activation smoothing, function-preserving layer splitting). Practical deployment in NLP, vision, and multi-modal domains depends on tuning calibration and error-minimizing quantization parameters to target workloads.
- Extreme low-bit regimes (INT2) remain fragile, but INT4 consistently enables attractive performance–accuracy trade-offs for inference and efficient training.
- With increasingly powerful tensor cores and new data formats (FP4, FP8), future quantized systems are expected to blur the boundaries between integer and low-bit floating-point deployments, with methods such as MoFQ/MatQuant being the standard for maximally efficient, unified model serving and adaptation (Zhang et al., 2023, Nair et al., 10 Feb 2025).
For further details and algorithmic recipes, refer directly to the cited works (Vicentino, 30 Mar 2026, Song et al., 7 Mar 2025, Wu et al., 2023, Yi et al., 2024, Kurtic et al., 2024, Hasan, 2024, Nair et al., 10 Feb 2025, Lee et al., 10 Jun 2025, Kim et al., 2024, Zhang et al., 2024, 2505.20839, Zhang et al., 2024, Zhang et al., 2023, Mekala et al., 26 May 2025).