Fully Quantized Training (FQT)
- Fully Quantized Training (FQT) is defined as end-to-end low-precision neural network optimization that quantizes weights, activations, and gradients to maximize memory, energy, and compute efficiency.
- It leverages unbiased stochastic rounding and innovative quantizer designs such as PSQ, BHQ, and ShiftQuant to control variance and ensure stable convergence in nonconvex settings.
- FQT implementations use integer-only and microformat pipelines across various architectures, achieving near-FP32 accuracy with significant resource savings in domains like LLMs and image classification.
Fully quantized training (FQT) refers to end-to-end neural network optimization in which all primary tensors—weights, activations, and gradients—are represented and operated on in low-precision (integer or floating-point microformat) domains throughout every stage of forward and backward propagation, including model updates. Unlike quantization-aware training (QAT), which typically restricts quantization to inference and forward activations, pure FQT eliminates floating-point shadow state and aims to maximize memory, energy, and compute efficiency by leveraging low-bitwidth arithmetic for every dataflow operation. This paradigm is increasingly critical for deployment on hardware accelerators, edge devices, and large-scale distributed training regimes, including those used for modern transformers and LLMs.
1. Statistical Foundations and Convergence Properties
The fundamental statistical principle underlying FQT is that quantized gradients—produced by symmetric, stochastic, low-bit quantizers—act as unbiased but higher-variance estimators of their full-precision counterparts. Given a parameter set and quantized gradients , the expectation if stochastic rounding is used independently across elements. The variance, however, is strictly larger: where is the layer-wise quantizer, , and encodes Jacobian sensitivity. For per-tensor quantization, as one lowers the bitwidth by 1, halves and variance increases fourfold, magnifying the efficiency/accuracy trade-off (Chen et al., 2020).
From a non-convex SGD convergence perspective, if the quantizer is unbiased and variance is finite, standard results yield
with bounding the increased variance. Unbiasedness is critical; bias in quantized gradients will induce systematic error and limit attainable optima (Chen et al., 2020).
2. Quantizer Designs and Variance Reduction
Early FQT relied on per-tensor uniform (linear) quantization, which introduces excess variance when data are heavy-tailed or certain samples or channels dominate. Two principled designs address this:
- Per-Sample Quantizer (PSQ): Scales each sample (row) in the gradient matrix individually, significantly reducing variance when many rows are low-range. Variance is bounded by , where is the range in row (Chen et al., 2020).
- Block-Householder Quantizer (BHQ): Uses blockwise Householder reflections to spread outlier contributions and further minimize blockwise variance, achieving scaling in the pathological one-row-dominant case. Complexity grows only modestly (Chen et al., 2020).
Recent frameworks generalize these approaches to group quantization with power-of-two shift-based scaling (ShiftQuant), enabling highly memory-local group-wise quantization and integer-only accumulation (ShiftMM), with variance arbitrarily close to the per-element lower bound for a moderate number of groups (Guo et al., 17 Nov 2024). Unbiased stochastic rounding remains an essential ingredient for achieving these variance guarantees and convergence stability.
3. End-to-End Integer and Floating-Point Microformat Pipelines
FQT implementations fall into several system-level patterns:
- Fixed-point and Integer Quantization: All weights, activations, gradients, errors, and updates—plus batch normalization (“WAGEUBN” framework)—are stored and operated on in 8-bit integers (Yang et al., 2019). Constant quantizers (bespoke for gradients), shift quantizers (for errors), and layered scaling are used to retain critical signal variance. In hardware, this allows full pipeline acceleration with ≤0.5–1.0% loss for image classification (Yang et al., 2019).
- Blockwise Floating-Point Formats: Modern “microformats” such as NVFP4, mixing E2M1 FP4 values in blocks with E4M3 per-block scales (block size 16), enable full FQT of LLMs and transformers, with all GEMMs and gradient updates performed in FP4 with scale factors (Chmiel et al., 25 May 2025, Chen et al., 31 Oct 2025). Forward pass uses round-to-nearest for stability; backward and update use stochastic rounding to ensure unbiasedness. Empirical results show BF16-comparable accuracy for Llama2-7B on 200B tokens (Chmiel et al., 25 May 2025).
- Group-Shared Exponent (GSE) Formats: For on-device LLM fine-tuning (GSQ-Tuning), one partitions tensors into non-overlapping groups, sharing a single exponent per group and representing mantissas as integers, allowing plug-in with LoRA-style adapters (Zhou et al., 18 Feb 2025). Integer-only training with shared exponent-mantissa layouts yields up to 11× area and 5× power reduction versus FP8, with sub-0.3% accuracy loss for language benchmarks.
- Pseudo-Quantization Training (PQT): Instead of hard rounding, PQT injects blockwise controlled stochastic noise (e.g., Gaussian rounded to small discrete values) into parameter updates and casts to low-precision FP. This eliminates optimizer inconsistency and allows dynamic adaptation of bitwidth, providing stability and efficiency at scale with ≈1.4% throughput penalty for LLMs (Ahn et al., 16 May 2025).
4. Implementation Practices and Theoretical Limits
In end-to-end FQT, key practices include:
- Quantizer placement: All heavy data-flow tensors (matrix multiply operands, normalization statistics, update accumulators) are quantized. First and last layers can, in principle, remain quantized, although for 1–2 bit settings some overhead is unavoidable (Chen et al., 2020, Yin et al., 2018).
- Dynamic or per-block scaling: Per-block or per-group scale parameters are necessary to contain quantization noise under heavy-tailed distributions, particularly in transformers and LLMs (Chmiel et al., 25 May 2025, Xi et al., 19 Mar 2024). Block sizes of 16–32 balance dynamic range with implementation overhead (Chmiel et al., 25 May 2025, Xi et al., 19 Mar 2024).
- Rounding strategies: Stochastic rounding is used for gradients and in backward/update passes for unbiasedness; round-to-nearest is typically reserved for forward activations and weights, where bias is less detrimental (Chmiel et al., 25 May 2025).
- Gradient quantization and variance control: To safely push bitwidth to the 1–4b regime, aggressive variance control is critical. Activation Gradient Pruning (AGP) selectively prunes low-range gradient groups and reallocates bitwidth, controlling total error. In practice, 1-bit FQT with SCQ (sample-channel joint quantization) and AGP can close the gap to direct PSQ quantization by 6 percentage points on typical fine-tuning tasks (Gao et al., 26 Aug 2024).
- Theoretical limits: There is a lower bound where the average per-coordinate gradient magnitude falls below (quantization noise). Below that threshold, quantized training fails to yield useful descent, and training stalls (Chmiel et al., 25 May 2025).
5. Empirical Results Across Architectures
FQT achieves highly competitive accuracy and dramatic resource savings in diverse settings:
- ImageNet ResNet-50: PTQ (8b) for gradients suffers ~1% accuracy loss versus QAT; novel PSQ/BHQ recover QAT accuracy at 8b, remain within 0.5% of QAT with BHQ at 5b; PTQ fails below 7b (Chen et al., 2020).
- Transformers for NMT and Language Modeling: FullyQT (8b) yields BLEU matching or exceeding FP32 on WMT tasks; memory and compute reduced by 4× versus float (Prato et al., 2019). Jetfire achieves 1.42× end-to-end transformer block speedup with 1.49× activation memory saving, matching FP16 accuracy with blockwise quantization (Xi et al., 19 Mar 2024).
- LLMs: NVFP4 (FP4, block size 16, E4M3 scale) enables full Llama2-7B pre-training in FP4 with BF16-matching performance, and QAF phase closes the last <1% gap (Chmiel et al., 25 May 2025). TetraJet-v2, via NVFP4 + OsciReset + OutControl, closes over half the gap to full-precision across 200B tokens for LLMs up to 370M (Chen et al., 31 Oct 2025).
- Tiny, Embedded, and On-Device Training: FQT on Cortex-M MCUs, with 8-bit quantization on all tensors, supports on-device transfer learning matching float32 within 1–2%; dynamic partial gradient updates yield up to 6.6× speedup (Deutel et al., 15 Jul 2024).
- Sub-8-Bit Integer Training: ShiftQuant (groupwise GEMM-friendly integer quantization) with quantized L1 normalization achieves accuracy drop on 4-bit ResNets and 0.6% on 6-bit Transformers, plus throughput improvement vs. FP16 on ARM CPUs (Guo et al., 17 Nov 2024).
6. Best Practices and Open Challenges
Key recommendations emerging from recent FQT literature include:
- Ensure quantized gradients are unbiased (use stochastic rounding wherever possible) (Chen et al., 2020, Chmiel et al., 25 May 2025, Gao et al., 26 Aug 2024).
- Monitor dynamic range and select the finest quantizer scale that maintains accuracy loss below 0.4–1%; for most vision/translation tasks, PSQ at 6–8b or BHQ at 5–6b is robust (Chen et al., 2020).
- Leverage per-block/group scaling and blockwise floating/microformats (e.g. FP4/E2M1 with E4M3 scale) in large transformer/LLM training (Chmiel et al., 25 May 2025, Chen et al., 31 Oct 2025, Guo et al., 17 Nov 2024).
- Deploy AGP or similar variance reduction for 1–4b settings, especially with adaptive (Adam) optimizers (Gao et al., 26 Aug 2024).
- Warm-start from a pre-trained full-precision teacher for extremely low bitwidths, and ensure that initial parameters are close in or norm to the quantized grid (Long et al., 2020, Yin et al., 2018).
Open areas include stable FQT-from-scratch for large LLMs at sub-4b, optimally choosing block/group partitions for microformat scaling, and efficient support for higher-order optimization (Adam, LAMB) in pure integer or microformat hardware. Better integration of dynamic quantization range tracking to minimize hardware and data-movement overhead (e.g., in-hindsight range estimation for fast static quantization (Fournarakis et al., 2021)) remains essential for highly resource-constrained settings.
7. Limitations and Future Directions
While FQT now delivers near-FP32 accuracy for a wide range of architectures and tasks, certain accuracy limits exist for extremely low-bit (≤2b) training without variance-control and bias-mitigation. A continuing challenge is to develop universal, robust algorithms for full-from-scratch training at extreme quantization that guarantee stability in nonconvex regimes and across diverse data distributions. Further, adapting and customizing FQT for non-GEMM-dominated architectures, online continual learning, and privacy-preserving deployment on edge hardware remain compelling research frontiers (Schiemer et al., 2023, Deutel et al., 15 Jul 2024, Zhou et al., 18 Feb 2025).