Integer Quantization Techniques
- Integer quantization is a method for encoding neural network weights and activations into low-precision integers to accelerate computation and reduce memory and power consumption.
- It employs techniques like uniform affine, symmetric, and dyadic scaling to map floating-point values into fixed-bit representations while mitigating quantization error.
- Recent advances integrate calibration, mixed-precision strategies, and integer-only nonlinear approximations to ensure high accuracy and efficiency in both inference and training.
Integer quantization refers to the encoding of neural network weights, activations, and sometimes gradients into low-precision, fixed-width signed or unsigned integer representations, permitting inference and/or training to be performed using integer arithmetic operations. Under this paradigm, each tensor element is constrained to a fixed bit-width (e.g., 8, 4, 2 bits) and mapped via a quantization function such as uniform affine, symmetric, or dyadic scaling. The main drivers for integer quantization are accelerated computation on integer math pipelines, reduced DRAM and cache traffic, and lower power and area consumption on hardware. This article surveys key principles, algorithmic pipelines, hardware integration, and recent research directions in integer quantization for neural networks.
1. Mathematical Foundations of Integer Quantization
At its core, integer quantization encodes a real tensor using a scale factor and an optional zero-point . Uniform affine quantization (Wu et al., 2020, Jacob et al., 2017, Kim et al., 2021) is widely adopted:
- Symmetric quantization:
mapping to in .
- Affine (asymmetric) quantization:
in ; is exactly representable.
Quantized inference replaces FP32 computation with integer arithmetic:
- Matrix multiplication:
for symmetric quant; cross-terms added for affine variants.
- Nonlinearities (GELU, Softmax, LayerNorm): Integer quantization requires polynomial or bit-shifting approximations (Kim et al., 2021, Kim et al., 19 Nov 2025, Chang et al., 2023).
Quantization error per element is bounded by for rounding, plus any overflow from outliers.
2. Integer-only Inference and Training Pipelines
Inference
Integer-only inference eliminates all floating-point computation. The process includes:
- Parameter quantization:
All weights/biases statically quantized using calibration statistics; typically per-channel for weights (Wu et al., 2020, Zhang et al., 2023).
- Activation quantization:
Offline calibration gathers activation ranges via minibatch statistics or synthetic surrogate data (Kim et al., 2021, Kim et al., 2021).
- Integer MAC and accumulation:
INT4–8 multiplies, INT32 accumulation, followed by integer rescale with fixed-point multiply and shift (dyadic arithmetic (Yao et al., 2020, Guo et al., 2021, Kim et al., 19 Nov 2025)).
- Nonlinearity rewrite:
All polynomial/log-shift approximations (GELU, Softmax, LayerNorm, Swish) are evaluated in integer (Kim et al., 2021, Kim et al., 19 Nov 2025, Chang et al., 2023).
Training
Sub-8-bit integer training leverages grouped scaling (ShiftQuant (Guo et al., 2024)) and integer normalization (L₁BNQ). Gradients and backprop flows are quantized in groups, enabling GEMM compatibility and unbiased estimation: where denotes stochastic rounding and are power-of-two scales per group. Integer-only L₁ normalization: where is quantized L₁ norm; yields improved quantization tolerance and smoother loss.
3. Calibration, Optimization, and Mixed-Precision Selection
Calibration
Accurate calibration of quantization parameters is critical. Methods include:
- Min-max, percentile, KL-divergence: Used for setting weight and activation ranges (Wu et al., 2020, Yao et al., 2020).
- Zero-shot calibration: Synthetic data generated to match BatchNorm running statistics, enabling PTQ without real data (Kim et al., 2021).
- Data-aware polynomial fitting: Activation and function approximations fit over actual activation distributions for improved fidelity in vision/data-specific ranges (Kim et al., 19 Nov 2025).
Optimization
- Alternating minimization (decoupleQ): Parameters are solved via mixed-integer quadratic programs for minimal output error, at very low bit-width (2–4 bits) (Guo et al., 2024).
- AdaQuant: Progressive layer-wise calibration optimizing quantization step and bias to minimize output MSE over calibration samples (Hubara et al., 2020).
- Integer Linear Programming (HAWQ-V3, FLIQS): Mixed-precision bit-width allocation solved using sensitivity (Hessian), latency, and model-size constraints (Yao et al., 2020, Dotzel et al., 2023).
Mixed-Precision and Format Selection
Recent work reveals that optimal quantization format (INT vs. FP, bit-width) varies per layer and per tensor (Zhang et al., 2023, Dotzel et al., 2023):
- Mixture of Formats Quantization (MoFQ): Layer-wise selection yields best accuracy and speed for LLMs at 4–8 bits (Zhang et al., 2023).
- Dual Grained Quantization (DGQ): Weights quantized group-wise at INT4, then composed at INT8 for GEMM compatibility (Zhang et al., 2023).
- NestQuant: Integer-nesting splits weights into high/low bits for on-device switching across resource budgets (Xie et al., 22 Jun 2025).
4. Integer Quantization of Nonlinear Operations
Full integer-only quantization requires all non-linearities to be accurately approximated:
- GELU/erf: Data-aware quartic polynomial approximation (optimized for activation ranges), achieving lower and errors than prior quadratic forms (Kim et al., 19 Nov 2025, Kim et al., 2021).
- Softmax: Shift-based (Bit-Softmax, Shiftmax), first-order Taylor expansion of enabling integer-only exponentiation and normalization (Kim et al., 19 Nov 2025, Chang et al., 2023, Kim et al., 2021).
- LayerNorm: Integer Newton or bit-shifted approximations of inverse square root (Chang et al., 2023).
A unified metric integrating sensitivity, perturbation, and operation count drives per-layer selection of approximation for mixed PTQ (Kim et al., 19 Nov 2025).
5. Hardware Implementation and Efficiency
Integer quantization schemes directly impact memory footprint, throughput, and power:
- Integer arithmetic modules: All operations (multiplication, addition, shift) performed in INT4/8/16, accumulation in INT32 on CPUs, GPUs, DSPs, and FPGAs (Guo et al., 2021, Yin et al., 2023, Yao et al., 2020).
- Fine-grained quantization bottlenecks: Frequent float-to-int conversions within fine-grained quantization loops are eliminated by integer scale techniques, restoring full hardware efficiency (Li et al., 2024).
- Multiplier-less quantization: Shared scale factor across SNN weights and membrane potentials enables full elimination of floating-point multipliers and reduces hardware area and energy up to 90% (Yin et al., 2023).
- TVM integer-only library: Mixed-precision dyadic quantization integrated as first-class graph attributes, enabling deployment and speedups of up to for INT4 over INT8 on ResNet-50 (Yao et al., 2020, Guo et al., 2021).
6. Empirical Results, Benchmarking, and Limitations
- Common tasks: Vision (ResNet, MobileNet, ViT, Swin), language (BERT, RoBERTa, LLaMA), speech (QuartzNet, Jasper, Conformer), SNNs (VGG, ResNet-19).
- Accuracy preservation: Most schemes achieve accuracy drop at 8 bits, at 4 bits, sub-1% on “easy” tasks (MNIST, CIFAR-10), and tolerable loss on “hard” tasks (ImageNet, BERT QA) (Wu et al., 2020, Guo et al., 2021, Kim et al., 19 Nov 2025).
- Memory, speed, footprint:
- Up to $4$– memory and bandwidth savings (Kim et al., 2023, Yin et al., 2023).
- End-to-end speedups – (GPU) due to integer-only logic (Jacob et al., 2017, Kim et al., 2021).
- On-device quantization with integer nesting diminishes model-switching overhead by (Xie et al., 22 Jun 2025).
- Failure modes: Sub-4-bit quantization induces accuracy collapse unless mitigated by grouped scaling, polynomial approximation, or alternating optimization (decoupleQ). Activations with high dynamic range remain challenging.
7. Current Trends and Future Directions
- Adaptive granularity: Per-layer mixed-precision, per-group scaling, and integer nesting are enabling dynamic resource tradeoffs (Xie et al., 22 Jun 2025, Dotzel et al., 2023, Zhang et al., 2023).
- Polynomial/data-aware function approximation: Vision and transformer networks benefit from activation-range-specific fitting for GELU and Softmax (Kim et al., 19 Nov 2025).
- Zero-shot and calibration-free PTQ: Surrogate-based calibration unlocks PTQ without data access in privacy-sensitive or federated settings (Kim et al., 2021).
- Sub-8-bit training: ShiftQuant and L₁ normalization show that pure-integer training with stability and accuracy close to FP32 is attainable for large networks (Guo et al., 2024).
- Plug-and-play integer scale: Integration of integer-only kernels into standard LLM quantization pipelines (GPTQ, AWQ, Omniquant) yields speed and accuracy improvements without changes to model code (Li et al., 2024).
Summary: Integer quantization techniques now support full integer-only inference and partial training across a wide array of architectures, leveraging uniform/dyadic/affine quantizers, calibration- and data-driven optimization, polynomial and bit-shifted nonlinear approximations, and mixed-precision allocation. These advances continue to redefine the Pareto frontier for memory, latency, and accuracy in deep learning model deployment (Wu et al., 2020, Kim et al., 19 Nov 2025, Yao et al., 2020, Li et al., 2024, Chang et al., 2023).