INT8 Scalar Quantization in Deep Learning

Updated 24 November 2025

INT8 scalar quantization is a method that maps high-precision floating-point values to 8-bit signed integers using a linear scaling factor, achieving a 4× reduction in memory size.
It employs per-tensor, per-channel, or per-block scaling calibrated via dynamic range or statistics to minimize quantization error, often resulting in less than 1% accuracy loss.
Empirical results show improvements in efficiency and energy savings, with up to 10× speedup on hardware accelerators and significant reductions in model size across diverse applications.

INT8 scalar quantization is a foundational technique for compressing and accelerating DNNs and LLMs by mapping high-precision floating-point tensors to 8-bit signed integers through a linear (scalar) mapping. The resulting 4× reduction in size (FP32→INT8) enables both memory savings and execution on fixed-point hardware with minimal performance loss when properly calibrated. Key applications include inference and training across vision, language, and retrieval models, with increasing deployment on mobile, edge, and data-center accelerators.

1. Mathematical Formulation and Core Algorithms

INT8 scalar quantization maps a floating-point value $w \in \mathbb{R}$ to $q \in \mathbb{Z}_{8}$ with a scale factor (often called $\gamma$ or $S$ ): $q = \mathrm{clip}\left(\mathrm{round}(w \cdot \gamma), q_{\min}, q_{\max}\right), \qquad \hat w = \frac{q}{\gamma}$ where $q_{\min} = -128$ , $q_{\max} = 127$ , and $b=8$ . The scaling factor $\gamma$ (or $S$ ) can be derived via the data range or second-moment statistics:

In post-training quantization (PTQ): $\gamma = \frac{2^b - 1}{w_{\max} - w_{\min}}$ .
In quantization-aware training (QAT): $\gamma = \sqrt{ \mathbb{E}[w^2] / \mathbb{E}[Q(w;\gamma)^2] }$ , enforcing energy preservation between quantized and original tensors (Hasan, 9 Nov 2024).

Zero-point is typically omitted (symmetric quantization), particularly in high-performance INT8 implementations (Wu et al., 2020, Chen et al., 25 Sep 2024).

Per-tensor, per-channel, or per-block scaling may be employed:

Per-tensor: One scale/zero-point per tensor (dense retrieval, basic LLMs) (Pati, 17 Nov 2025).
Per-channel: One scale per output channel (CNNs, increases accuracy for variable distributions) (Zhao et al., 2021).
Per-block: Partition tensor into $B \times B$ blocks, each with its own scale (transformer/GPU-friendly, balances error and computation) (Xi et al., 19 Mar 2024).

2. Calibration, Scale Optimization, and Practical Implementation

The selection of scale/zero-point critically impacts quantization error. Calibration is typically performed on small calibration sets:

For PTQ, initial scale is determined by dynamic range:

$s = \frac{w_{\max} - w_{\min}}{2^b-1}, \quad \gamma = 1/s$

In frameworks like EasyQuant, alternating greedy optimization maximizes cosine similarity between quantized output and true FP output per layer, often using 1D grid search over scale for weights and activations (Wu et al., 2020).

Table: Example INT8 Quantization Performance (EasyQuant, ImageNet-1k, Top-1)

Model	FP32	INT8 (TRT)	INT8 (EQ)
ResNet-50	75.20%	75.04%	75.13%
MobileNet-V1	69.33%	68.74%	68.84%

Advanced schemes deploy per-channel or per-block scale factors to further minimize error, especially where per-tensor statistics are too coarse (notably for gradient quantization in training pipelines (Zhao et al., 2021), block-level in transformers (Xi et al., 19 Mar 2024), or token-level in self-attention (Chen et al., 25 Sep 2024)).

Affine quantization (with zero-point) is also used for embeddings:

$q_i = \mathrm{clamp}( \mathrm{round}( x_i / s + z ), q_{\min}, q_{\max} )$
$z = \mathrm{round}( -x_{\min}/s ) + q_{\min}$ Enables mapping arbitrary float ranges to INT8 for dense retrieval, yielding near-lossless retrieval performance in practice at 4× compression (Pati, 17 Nov 2025).

3. Training and Backpropagation with INT8 Quantization

End-to-end INT8 training is realized by quantizing not just weights and activations, but also gradients, errors, batch-norm statistics, and updates (Yang et al., 2019):

Forward and backward passes operate in the INT8 domain using integer quantization, e.g., $Q_{A}(a) = Q(a,k_{A})$ , $Q_{E_1}(e) = SQ(e, k_{E_1})$ .
Gradients are quantized either channelwise (vectorized quantization) or globally, with channelwise approaches yielding lower error, especially in non-Gaussian distributions—adapted via magnitude-aware clipping and per-channel threshold detection (Zhao et al., 2021).
Quantized momentum and optimizer updates are implemented with bit-width constraints ensuring no "escape hatch" to FP in the training loop.

WAGEUBN implements direct, constant, and shift quantizers to address rounding artifacts and precision mismatches in different data paths, while keeping the full loop in INT8 (Yang et al., 2019).

Empirical results on ResNet/Imagenet show within 1–3% top-1 accuracy drop for pure INT8 end-to-end training, with 4× memory reduction and up to 10–30× energy efficiency improvement on FPGA (Yang et al., 2019).

4. Hardware Realization and Execution Efficiency

Modern inference and training accelerators favor scalar INT8 quantization due to:

Fixed 8-bit integer arithmetic units present in all major hardware backends (ARM NEON, NVIDIA Tensor Cores, mobile DSPs, etc.).
Quantized representations being 4× smaller than FP32, increasing arithmetic intensity and memory throughput.

For CPUs lacking native INT8 SIMD, software techniques such as Scalar Arithmetic Multiple Data (SAMD) partition machine words into parallel INT8 lanes for efficient bit-precise arithmetic, achieving 4–10× speedups over native 8-bit or FP32 on both ARM and x86 (Anderson et al., 2018).

On GPUs, INT8 tensor-cores accelerate matmuls, and quantized dataflow designs eliminate frequent quantize/dequantize overheads (e.g., Jetfire INT8 transformer pretraining is 1.4× faster and 1.5× leaner in memory than FP16 (Xi et al., 19 Mar 2024)). INT-FlashAttention implements per-token or per-block INT8 quantization to realize $>70\%$ inference speedup versus FP16, with $>45\%$ quantization error reduction compared to FP8 (Chen et al., 25 Sep 2024).

Empirical deployment on edge devices (e.g., RK3399, Qualcomm Hexagon) demonstrates 2.4× throughput and 40% power reduction at minimal (<1%) accuracy loss (Hasan, 9 Nov 2024, Wu et al., 2020, Uss et al., 22 May 2024).

5. Extensions, Innovations, and Advanced Error Mitigation

Recent work addresses limits and extents of scalar INT8 quantization:

Mixed-precision optimization with classical Lagrangian allocation across layers: $b^*_l = \frac{1}{2} \log_2 ( \frac{\alpha_l \sigma^2_l}{\lambda} )$ , but in uniform INT8 all $b_l = 8$ (Hasan, 9 Nov 2024).
Redundant output representations (2D Hilbert curve mapping) reduce quantization error 5× for bounded regression outputs at $<7\%$ runtime cost (Uss et al., 22 May 2024).
Clipping and scale distribution strategies, including outlier-aware per-block partitioning, per-channel scaling, and magnitude-aware loss terms, mitigate harsh artifacts and performance drops in highly nonuniform tensor distributions (Zhao et al., 2021).
For dense retrieval, INT8 scalar quantization with global per-tensor scaling achieves $<0.2\%$ nDCG@10 drop over float32, outperforming same-compression-ratio autoencoders by $>6\%$ (Pati, 17 Nov 2025).

Table: Model Compression and Effect on Retrieval (BEIR SciFact)

Type	Bytes/vec	Compression	$\Delta$ nDCG@10
FP32	1536	1×	0.0000
FP16	768	2×	–0.00018
INT8	384	4×	–0.00178
Binary	48	32×	–0.46621

6. Deployment, Toolchains, and Limitations

Deployment pipelines exploit vendor-supplied automation (TensorRT, ONNX Runtime, QAT) for quantization graph rewriting and calibration (Hasan, 9 Nov 2024, Wu et al., 2020). INT8 is universally supported, unlike INT4 or custom bit-widths.

Main practical limitations are:

Sensitivity to outliers: A single large element can distort global scale, motivating per-channel/block strategies.
Quantization-aware training (QAT) can further close accuracy gaps at additional training and memory cost.
Scaling parameters ( $\gamma$ , $S$ ) must be stored and supported in inference code but are negligible in overhead.
Extremely low-bit quantization (e.g., binary) causes catastrophic performance drops in high-dimensional retrieval and regression (Pati, 17 Nov 2025, Uss et al., 22 May 2024).
Pure post-training quantization may incur higher errors in highly sensitive layers or ill-conditioned models, for which training-aware schemes are preferable (Hasan, 9 Nov 2024).

7. Impact and Empirical Results

INT8 scalar quantization, when combined with robust scale calibration and, where suitable, quantization-aware training, yields:

$68\%$ model size reduction and $40\%$ compute/energy savings in LLMs for up to $1$B parameters with $<6\%$ performance drop (Hasan, 9 Nov 2024).
Speedups of 1.4–2.4× (and up to 10× in hand-optimized/packed software) for both training and inference on edge, CPU, and GPU, typically at $<1\%$ accuracy loss for vision and retrieval (Wu et al., 2020, Anderson et al., 2018).
End-to-end INT8 training matches or slightly lags FP16/FP32, with full-precision convergence achieved by tuning mid-layer errors, using mixed-precision for sensitive paths, or leveraging redundant output representations (Yang et al., 2019, Zhao et al., 2021, Uss et al., 22 May 2024).