Papers
Topics
Authors
Recent
Search
2000 character limit reached

QDrop: Low-Bit Post-Training Quantization

Updated 17 March 2026
  • QDrop is a post-training quantization technique that improves the accuracy and robustness of neural networks quantized to 2 bits by mitigating activation and weight noise interactions.
  • It introduces random dropping of activation quantization during block-wise reconstruction, fostering flatter loss landscapes and enhanced generalization from calibration to test data.
  • QDrop achieves state-of-the-art results in image classification, object detection, and NLP without necessitating retraining or large calibration sets.

QDrop is a post-training quantization (PTQ) technique designed to improve the accuracy and robustness of neural networks quantized to extremely low precision, specifically targeting cases where both weights and activations are quantized to as low as 2 bits. QDrop addresses the key failure mode in conventional PTQ methods where the interaction between activation noise and weight noise is not properly taken into account, leading to substantial accuracy degradation under aggressive quantization. By introducing random dropping of @@@@1@@@@ during block-wise PTQ reconstruction, QDrop enables optimization for a flatter and more robust quantized solution landscape, generalizing better from calibration to test data and achieving state-of-the-art results in image classification, object detection, and natural language processing without requiring retraining or large calibration sets (Wei et al., 2022).

1. Problem Setting and Limitations of Conventional PTQ

Post-training quantization seeks to convert a pre-trained full-precision (FP32) network f(x;W)f(x;W) into a quantized low-bit model (W^,a^)(\hat W,\hat a), using a small calibration set Dc\mathcal D_c, without the extensive cost of end-to-end retraining. Activations aa and weights WW are quantized using a uniform quantizer: a^=ass\hat a = \left\lfloor \frac{a}{s} \right\rceil \cdot s where ss is a learnable step size. Standard PTQ approaches (e.g., AdaRound, BRECQ) model weight rounding as quantization noise and perform sequential block reconstructions, treating activations as fixed until final scale setting.

At very low bitwidths (e.g., 2–3 bits for activations), the quantization noise from activations is no longer negligible and interacts non-trivially with weight quantization. Neglecting this interaction during calibration constrains all block reconstructions to solutions tailored to the same (uninformed) activation configuration, leading to sub-optimality and convergence failures—particularly acute in lightweight architectures and under aggressive quantization.

2. Theoretical Framework and Flatness Criterion

QDrop’s core theoretical advance is to quantify the interaction between weight and activation noise within the PTQ calibration objective, connecting the problem to the flatness of the loss landscape.

Let L(f(x;W),y)L(f(x;W),y) denote loss and W^=W+Δ\hat W = W+\Delta represent quantized weights. Activation quantization is modeled as multiplicative noise: a^=a(1+u),uR\hat a = a(1+u), \quad u \in \mathbb{R} The PTQ reconstruction minimizes: minW^  E(x,y)Dc[L(f(x;W+Δ),y;1+u(x))L(f(x;W),y;1)]\min_{\hat W} \; E_{(x,y)\sim\mathcal D_c} \Big[ L(f(x;W+\Delta), y; 1+u(x)) - L(f(x;W), y; 1) \Big] A key result (Lemma 3.1) shows activation noise can be equivalently recast as a weight perturbation: E[L(W^,1+u)L(W,1)]E[L(W^(1+v),1)L(W,1)]E[L(\hat W,1+u)-L(W,1)] \approx E[L(\hat W \odot (1+v), 1) - L(W,1)] for vv derived from uu.

The PTQ loss then decomposes (Theorem 3.1): E[L(W^,1+u)L(W,1)]E[L(W^,1)L(W,1)]+E[L(W^(1+v),1)L(W^,1)]E[L(\hat W,1+u) - L(W,1)] \approx E[L(\hat W,1) - L(W,1)] + E[L(\hat W\odot(1+v),1) - L(\hat W,1)] The first term is the classical weight quantization loss; the second term measures the “flatness” of the quantized solution with respect to weight perturbation, following the definition of flatness as

Flatness(W^)=EvD[L(fW^(1+v))L(fW^)]\text{Flatness}(\hat W) = E_{v \sim \mathcal D} \left[ L(f_{\hat W \odot (1+v)}) - L(f_{\hat W}) \right]

Minimizing both terms simultaneously yields quantized models whose loss landscapes are flatter under the true noisier test-time conditions, enhancing generalization and accuracy at low bitwidth.

3. QDrop Algorithmic Design

QDrop operationalizes the above theory by randomly dropping activation quantization during each forward pass in block-wise PTQ reconstruction, thereby introducing stochastic diversity in the calibration process.

For a block kk, and drop probability pp:

  1. Forward pass:
    • For each neuron in the block input, with probability pp, replace quantized activation h^i1\hat h^{i-1} with the full-precision hi1h^{i-1}, yielding mixed input h~i1\tilde h^{i-1}.
    • For each layer =ij\ell = i\, …\,j:
      • Compute h=f(h1)h^\ell = f(h^{\ell-1}) with mixed input.
      • Quantize: h^=Quantize(h)\hat h^\ell = \text{Quantize}(h^\ell).
      • With probability pp, keep h~=h\tilde h^\ell = h^\ell; else h~=h^\tilde h^\ell = \hat h^\ell.
  2. Backward pass:
    • Compute reconstruction error Δj=h~jhj\Delta^j = \tilde h^j - h^j.
    • Update rounding parameters for weights via gradient with respect to mean-squared error.

The stochastic mask is: u={0,with prob. p a^/a1,with prob. 1pu = \begin{cases} 0, & \text{with prob. } p \ \dfrac{\hat a/a - 1}, & \text{with prob. } 1 - p \end{cases} This mechanism cultivates a broader set of perturbation directions, enabling flatter and more generalizable minima across calibration and test data.

4. Hyperparameter Configuration and Calibration Protocol

Empirical studies sweep drop probability pp over {0,0.25,0.5,0.75,1}\{0, 0.25, 0.5, 0.75, 1\}, identifying p=0.5p=0.5 as optimal across diverse settings. Recommended calibration sets are: ImageNet (1,024 images), COCO detection (256 images), NLP (GLUE/SQuAD, 1,024 examples). Each block is reconstructed for 20,000 iterations using learning rates compatible with BRECQ (weight-tuning LR 1×1031 \times 10^{-3}, activation scale LR 4×1054 \times 10^{-5}, batch size 32).

Annealing or per-model tuning of pp is permissible, though p=0.5p=0.5 is robust for both convolutional and Transformer models. First and last layers are typically retained at 8 bits to preserve representational capacity.

5. Experimental Results Across Domains

QDrop demonstrates substantial improvements in top-1 accuracy and mean average precision for extremely low-bit quantized models across computer vision and natural language tasks. Key outcomes include:

Task/Model Bitwidth Baseline QDrop Improvement
ResNet-50 (ImageNet) W2A2 29.01 % 54.74 % +25.73 %
RegNet-3.2G (ImageNet) W2A2 3.62 % 52.36 % +48.74 %
MobileNetV2 (W3A3) 23.41 % 57.98 % +34.57 %
RetinaNet/MobileNetV2 W2A4 (COCO) 19.35 mAP 25.04 mAP +6.5
QNLI (GLUE, E8W4A4) +8.7 % (vs. No-drop)
Cross-domain: CIFAR→IN W4A4 46.83 % 52.88 % +6.05 %

Across all cases, QDrop consistently outperforms both AdaRound and BRECQ, with particularly large gains at W2A2 and other low-bit settings. On lightweight models, relative gains are exacerbated, and similarity of calibration and test distributions is rendered less critical.

6. Insights from Ablation Studies and Landscape Analysis

Ablation studies validate the importance of intermittent activation quantization:

  • Purely weight-focused calibration (Case 1) fails at low bits.
  • Always quantizing activations during calibration (Case 2) yields calibration convergence but poor generalization due to mismatch with actual test conditions.
  • QDrop (mixed, stochastic activation dropping) achieves optimal balance, minimizing overfitting and maximizing test accuracy.

Hessian spectrum analysis reveals that QDrop leads to quantized solutions with the smallest principal eigenvalues (e.g., λ1\lambda_1, λ5\lambda_5, Tr(H)\mathrm{Tr}(H)), consistent with the theoretical “flatness” criterion and with improved generalization.

Optimal drop probability is found at p=0.5p=0.5; both smaller and larger values underexplore the perturbation space, degrading calibration effectiveness.

7. Deployment and Integration Recommendations

QDrop functions as a plug-in wrapper compatible with any block-reconstruction PTQ pipeline (e.g., AdaRound or BRECQ) and has open-source implementations in MQBench (https://github.com/ModelTC/MQBench) and at https://github.com/wimh966/QDrop. The additional computational overhead is minimal, involving only Bernoulli-masked activations during forward passes.

Default recommendations:

  • Use drop probability p=0.5p=0.5.
  • Retain per-channel weight quantization, LSQ for activation scale, and 20K iterations per block.
  • Keep first/last layers at higher bitwidth where needed.
  • Calibration can be performed even with cross-domain data, enhancing practical deployability.

By theory and empirical investigation, QDrop’s random dropping of activation quantization is shown to foster flatter, more robust quantized neural solutions, enabling efficient and accurate sub-4-bit PTQ for real-world deep learning tasks—achieving, for the first time, practical 2-bit activation PTQ across both computer vision and LLMs (Wei et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QDrop.