Papers
Topics
Authors
Recent
2000 character limit reached

Direct Quantized Training (DQT) Methods

Updated 10 February 2026
  • Direct Quantized Training (DQT) is a neural network training approach that constrains weights, activations, and optimizer states to low-bit representations throughout both forward and backward passes.
  • It employs methods such as stochastic rounding, discrete evolution strategies, and adaptive per-layer bit-width assignment to eliminate high-precision storage and optimize resource usage.
  • Empirical results across domains—from image classification to on-device microcontroller training—demonstrate that DQT achieves significant memory reduction and near-parity in accuracy compared to full-precision training.

Direct Quantized Training (DQT) refers to a family of neural network training methodologies in which model weights, activations, and optionally optimizer states are maintained and updated entirely in low-bit-width, quantized representations throughout all phases of optimization. In contrast to standard post-training quantization or quantization-aware training techniques, which rely on high-precision “shadow” copies and straight-through estimators (STE), DQT eliminates high-precision weight storage, enabling memory- and compute-efficient training on severely resource-constrained hardware and at scale. DQT methods span stochastic rounding, evolution-based zeroth-order optimization, theoretical analyses under convex and non-convex settings, adaptive per-layer bit-width assignment, and hardware-specific full quantization approaches.

1. Formal Definition and Core Principles

DQT constrains all neural network weight tensors (and, depending on the variant, activations and gradients) to a low-bit lattice—typically binary, ternary, 4-bit, or 8-bit—during both forward and backward passes. Quantization operators Q()Q(\cdot) map a real-valued parameter vector to the nearest or stochastically sampled quantized grid point. The canonical deterministic quantizer is defined by

Qd(w)=sign(w)ΔwΔ+12,Q_d(w) = \operatorname{sign}(w)\,\Delta\,\Big\lfloor \frac{|w|}{\Delta} + \frac{1}{2} \Big\rfloor,

whereas the unbiased stochastic version is

Qs(w)=Δ{z,w.p. 1(zz), z+1,w.p. zz,  where z=wΔ.Q_s(w) = \Delta \begin{cases} \lfloor z \rfloor, & \text{w.p. } 1-(z-\lfloor z \rfloor),\ \lfloor z \rfloor+1, & \text{w.p. } z-\lfloor z \rfloor,\ \end{cases} \text{ where } z = \frac{w}{\Delta}.

All parameter updates and inference calculations are performed using only the quantized versions (Li et al., 2017), and no high-precision master copy is maintained, except for minor buffers (momentum, variance accumulators) in some realizations.

The architectural and algorithmic advantages include significant memory and bandwidth reduction (by factors proportional to $32/k$ for kk-bit quantization), quantization of both forward and backward passes (enabling “fully quantized training”), and the removal of representational mismatch between training and deployment hardware (Zhao et al., 2024, Deutel et al., 2024).

2. Algorithmic Realizations and Variations

2.1 Stochastic-Rounding DQT

Stochastic rounding allows the direct update of quantized weights while preserving convergence properties. The stochastic quantizer SR(x)\mathrm{SR}(x) assigns a real-valued parameter to one of two nearest grid points, with probability proportional to its distance from these points, ensuring unbiasedness. Training proceeds as follows: the gradient is computed with respect to the quantized weights, a floating-point optimizer update (e.g., AdamW) is computed, and the updated parameter is immediately stochastically quantized back to the low-bit grid. No high-precision “master copy” is stored at any point (Zhao et al., 2024).

2.2 Direct Optimization in Discrete Space

DQT variants such as Quantized Evolution Strategies (QES) operate entirely in the discrete parameter space, eschewing gradients for unbiased zeroth-order estimates via randomly perturbed perturbations and accumulated error-feedback. These methods rely on a memory-efficient buffer (e.g., “seed-replay”) to reconstruct error-feedback terms without explicit high-precision storage (Xu et al., 3 Feb 2026). The parameterization for weight update is

Wt+1=Wt+Round[αg^t+γet1],W_{t+1} = W_t + \mathrm{Round}\left[\alpha \hat{g}_t + \gamma e_{t-1}\right],

where ete_{t} accumulates the unexpressed portion of previous updates. This temporally approximates high-precision optimization trajectories.

2.3 Per-Layer Adaptivity and Mixed-Precision

Adaptive Precision Training (AdaPT) assigns bit-widths per layer, either lowering or increasing precision based on information-theoretic KL-divergence tests and gradient diversity heuristics, to avoid information loss and vanishing gradients. Loss functions are augmented with regularizers penalizing wide word lengths and promoting sparsity (Kummer et al., 2021).

2.4 Channel-Wise and Sectional Quantization

Direct updates of quantized parameters can be extended to per-channel or per-section quantization levels, as in DCQ (Divide and Conquer Quantization), which independently distills and quantizes network sections using intermediate feature maps as regression targets, improving trainability and representational quality in the quantized domain (Elthakeb et al., 2019, Hoang et al., 2020).

2.5 Distributed and On-Device Fully Quantized Training

QSDP (Quantized Sharded Data-Parallel SGD) provides theoretical convergence guarantees for distributed DQT by applying unbiased “coin-flip” or lattice-based quantizers for both weights and gradients, operating directly on the quantized lattice during sharded multi-node training (Markov et al., 2023). For hardware-constrained devices, DQT is realized through fixed-point integer representation and quantized error propagation, often with dynamic partial gradient updates to further minimize computation and memory, as demonstrated on Cortex-M MCUs (Deutel et al., 2024).

3. Theoretical Guarantees and Convergence Behavior

Under classical convexity and smoothness assumptions, stochastic-rounding DQT with decaying stepsizes achieves error floors proportional to the quantization step,

E[F(wˉT)F(w)]O(logTT)+O(dΔ),\mathbb{E}[F(\bar w^T) - F(w^*)] \leq O\left(\frac{\log T}{T}\right) + O(\sqrt{d}\Delta),

where Δ\Delta is the quantization step, dd the parameter dimension (Li et al., 2017). In nonconvex settings, the lack of high-precision “greedy search” causes purely quantized methods to stagnate; theoretical analysis reveals that, as the learning rate decreases, the stationary distribution of parameters remains dispersed over multiple basins, without concentrating on a local minimum (Li et al., 2017). Remedies involve hybrid accumulators (e.g., BinaryConnect), large-batch updates to reduce gradient variance, or adaptive step-size schedules.

Distributed DQT achieves linear convergence (in the sense of Polyak–Łojasiewicz inequality) to the optimal lattice point, with error bounded in terms of the quantizer resolution and unbiasedness (Markov et al., 2023).

4. Quantization Operators, Bit-Width Assignment, and Range Estimation

A variety of quantization maps are used in DQT:

  • Uniform fixed-point quantization: Parameters are mapped to the closest of 2b2^b levels over a specified range, with optional two’s-complement representation and stochastic or deterministic rounding.
  • Learned quantization basis: Weights and activations can be quantized using learnable basis vectors and binary or one-hot encodings, facilitating gradient-based updates even in the quantized domain (Hoang et al., 2020).
  • In-hindsight range estimation: Instead of computing activation/gradient ranges dynamically at every step (which is memory-traffic-intensive), DQT can use exponential moving averages from the previous iteration, supporting efficient on-chip range updates for fully quantized inference and training (Fournarakis et al., 2021).

Adaptive precision can be assigned using information-theoretic tests (e.g., KL-divergence between quantized and original distribution) and gradient-diversity monitoring to automatically adjust layer bit-widths during training (Kummer et al., 2021).

5. Empirical Performance and Application Domains

DQT methods have been validated on a variety of architectures and tasks:

  • Image Classification: AlexNet, ResNet-18/34, MobileNetV2 trained with DQT at 2/2 and 3/3 weight/activation bits on CIFAR-100 and ImageNet match or surpass prior low-bit quantization schemes, reducing the accuracy gap to floating-point models to within 0.5–3% (Hoang et al., 2020).
  • LLMs: Direct quantized training (ternary or 8-bit) for LLaMA-style models achieves comparable loss and perplexity to STE-based QAT (BitNet) and only a modest degradation relative to FP32 baselines, with 30–40% memory savings (Zhao et al., 2024).
  • Quantized Evolution Strategies: On arithmetic reasoning and LLM fine-tuning, QES achieves 3–8x improvement in accuracy compared to previous quantized zeroth-order methods under INT4/INT8 constraints, without incurring the memory cost of high-precision residuals (Xu et al., 3 Feb 2026).
  • On-Device Microcontroller Training: Integer-only DQT on Cortex-M MCUs supports stable on-device learning within stringent RAM (256 KB) and energy budgets, with only 2–3% absolute accuracy loss compared to FP32 (Deutel et al., 2024).

Key empirical insights are summarized in the table below.

Domain / Method Bit-widths Memory Reduction Accuracy Degradation Notable Features
LLaMA (DQT) (Zhao et al., 2024) 1.58b, 8b 30–40% <5% (8b), ~15% (1.58b) No STE, supports ternary inference
ImageNet/AlexNet/ResNet (Hoang et al., 2020) 1–3 bits (W/A) 85–95% 0–3% Learned quantization basis, per-channel fitting
AdaPT (Kummer et al., 2021) Adaptive (2–8b) 45% model size Up to +1.4% delta Layerwise dynamic adjustment, KL-based
QSDP (GPT, FSDP) (Markov et al., 2023) 8b–5b (W/G) end-to-end comm. <0.2 ppl (perplexity) Proof of linear convergence in quantized space
Cortex-M DNNs (Deutel et al., 2024) 8b (uint/int) 30–50% RAM 2–3% (small), 5–8% (complex) Full forward/backward quantization

6. Practical Considerations, Hardware Mapping, and Limitations

DQT’s efficiency and feasibility are highly dependent on hardware support for fixed-point MAC primitives, as well as software infrastructure for quantization-aware pipelines. Empirical reports note substantial DRAM traffic and compute reduction with in-hindsight static quantization (Fournarakis et al., 2021) and sub-100 ms per-sample training on microcontrollers (Deutel et al., 2024). Challenges include tuning quantization ranges (momentum schedules, outlier adaptation), retaining optimizer states in floating point due to their dynamical range (Zhao et al., 2024), and increased convergence difficulty at extremely low bit-widths or without large-batch updates (Li et al., 2017).

Directly quantized distributed training enables scaling to billion-parameter models by removing bandwidth bottlenecks at no accuracy cost, contingent on unbiased quantizers and moderate bit-widths (Markov et al., 2023). On the other hand, in on-device learning, mixed-precision schemes (e.g., quantized convolutional layers with full-precision heads) help recover most lost accuracy and enable on-chip adaptation in ultra-constrained settings (Deutel et al., 2024).

7. Impact, Extensions, and Future Directions

DQT eliminates the need for full-precision refinement and reduces model storage and training cost, enabling powerful models on edge devices and practical billion-scale training on conventional hardware. Ongoing research explores non-uniform and binary quantization, dynamic bit-width schedules, further reduced-precision optimizers, more robust theoretical guarantees in the non-convex regime, and parallelized replay or error-feedback schemes for further acceleration (Xu et al., 3 Feb 2026, Zhao et al., 2024). Hardware co-design remains an active area to unlock the full benefits of DQT pipelines (Fournarakis et al., 2021, Deutel et al., 2024).

Direct Quantized Training has unified multiple research directions in compressed learning, from hardware-accelerated on-device adaptation to efficient distributed optimization of LLMs, with a growing body of theory and empirical evidence supporting its applicability and robustness across diverse regimes.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Direct Quantized Training (DQT).