QDrop: Low-Bit Post-Training Quantization
- QDrop is a post-training quantization technique that improves the accuracy and robustness of neural networks quantized to 2 bits by mitigating activation and weight noise interactions.
- It introduces random dropping of activation quantization during block-wise reconstruction, fostering flatter loss landscapes and enhanced generalization from calibration to test data.
- QDrop achieves state-of-the-art results in image classification, object detection, and NLP without necessitating retraining or large calibration sets.
QDrop is a post-training quantization (PTQ) technique designed to improve the accuracy and robustness of neural networks quantized to extremely low precision, specifically targeting cases where both weights and activations are quantized to as low as 2 bits. QDrop addresses the key failure mode in conventional PTQ methods where the interaction between activation noise and weight noise is not properly taken into account, leading to substantial accuracy degradation under aggressive quantization. By introducing random dropping of @@@@1@@@@ during block-wise PTQ reconstruction, QDrop enables optimization for a flatter and more robust quantized solution landscape, generalizing better from calibration to test data and achieving state-of-the-art results in image classification, object detection, and natural language processing without requiring retraining or large calibration sets (Wei et al., 2022).
1. Problem Setting and Limitations of Conventional PTQ
Post-training quantization seeks to convert a pre-trained full-precision (FP32) network into a quantized low-bit model , using a small calibration set , without the extensive cost of end-to-end retraining. Activations and weights are quantized using a uniform quantizer: where is a learnable step size. Standard PTQ approaches (e.g., AdaRound, BRECQ) model weight rounding as quantization noise and perform sequential block reconstructions, treating activations as fixed until final scale setting.
At very low bitwidths (e.g., 2–3 bits for activations), the quantization noise from activations is no longer negligible and interacts non-trivially with weight quantization. Neglecting this interaction during calibration constrains all block reconstructions to solutions tailored to the same (uninformed) activation configuration, leading to sub-optimality and convergence failures—particularly acute in lightweight architectures and under aggressive quantization.
2. Theoretical Framework and Flatness Criterion
QDrop’s core theoretical advance is to quantify the interaction between weight and activation noise within the PTQ calibration objective, connecting the problem to the flatness of the loss landscape.
Let denote loss and represent quantized weights. Activation quantization is modeled as multiplicative noise: The PTQ reconstruction minimizes: A key result (Lemma 3.1) shows activation noise can be equivalently recast as a weight perturbation: for derived from .
The PTQ loss then decomposes (Theorem 3.1): The first term is the classical weight quantization loss; the second term measures the “flatness” of the quantized solution with respect to weight perturbation, following the definition of flatness as
Minimizing both terms simultaneously yields quantized models whose loss landscapes are flatter under the true noisier test-time conditions, enhancing generalization and accuracy at low bitwidth.
3. QDrop Algorithmic Design
QDrop operationalizes the above theory by randomly dropping activation quantization during each forward pass in block-wise PTQ reconstruction, thereby introducing stochastic diversity in the calibration process.
For a block , and drop probability :
- Forward pass:
- For each neuron in the block input, with probability , replace quantized activation with the full-precision , yielding mixed input .
- For each layer :
- Compute with mixed input.
- Quantize: .
- With probability , keep ; else .
- Backward pass:
- Compute reconstruction error .
- Update rounding parameters for weights via gradient with respect to mean-squared error.
The stochastic mask is: This mechanism cultivates a broader set of perturbation directions, enabling flatter and more generalizable minima across calibration and test data.
4. Hyperparameter Configuration and Calibration Protocol
Empirical studies sweep drop probability over , identifying as optimal across diverse settings. Recommended calibration sets are: ImageNet (1,024 images), COCO detection (256 images), NLP (GLUE/SQuAD, 1,024 examples). Each block is reconstructed for 20,000 iterations using learning rates compatible with BRECQ (weight-tuning LR , activation scale LR , batch size 32).
Annealing or per-model tuning of is permissible, though is robust for both convolutional and Transformer models. First and last layers are typically retained at 8 bits to preserve representational capacity.
5. Experimental Results Across Domains
QDrop demonstrates substantial improvements in top-1 accuracy and mean average precision for extremely low-bit quantized models across computer vision and natural language tasks. Key outcomes include:
| Task/Model | Bitwidth | Baseline | QDrop | Improvement |
|---|---|---|---|---|
| ResNet-50 (ImageNet) | W2A2 | 29.01 % | 54.74 % | +25.73 % |
| RegNet-3.2G (ImageNet) | W2A2 | 3.62 % | 52.36 % | +48.74 % |
| MobileNetV2 (W3A3) | 23.41 % | 57.98 % | +34.57 % | |
| RetinaNet/MobileNetV2 | W2A4 (COCO) | 19.35 mAP | 25.04 mAP | +6.5 |
| QNLI (GLUE, E8W4A4) | – | +8.7 % (vs. No-drop) | – | |
| Cross-domain: CIFAR→IN | W4A4 | 46.83 % | 52.88 % | +6.05 % |
Across all cases, QDrop consistently outperforms both AdaRound and BRECQ, with particularly large gains at W2A2 and other low-bit settings. On lightweight models, relative gains are exacerbated, and similarity of calibration and test distributions is rendered less critical.
6. Insights from Ablation Studies and Landscape Analysis
Ablation studies validate the importance of intermittent activation quantization:
- Purely weight-focused calibration (Case 1) fails at low bits.
- Always quantizing activations during calibration (Case 2) yields calibration convergence but poor generalization due to mismatch with actual test conditions.
- QDrop (mixed, stochastic activation dropping) achieves optimal balance, minimizing overfitting and maximizing test accuracy.
Hessian spectrum analysis reveals that QDrop leads to quantized solutions with the smallest principal eigenvalues (e.g., , , ), consistent with the theoretical “flatness” criterion and with improved generalization.
Optimal drop probability is found at ; both smaller and larger values underexplore the perturbation space, degrading calibration effectiveness.
7. Deployment and Integration Recommendations
QDrop functions as a plug-in wrapper compatible with any block-reconstruction PTQ pipeline (e.g., AdaRound or BRECQ) and has open-source implementations in MQBench (https://github.com/ModelTC/MQBench) and at https://github.com/wimh966/QDrop. The additional computational overhead is minimal, involving only Bernoulli-masked activations during forward passes.
Default recommendations:
- Use drop probability .
- Retain per-channel weight quantization, LSQ for activation scale, and 20K iterations per block.
- Keep first/last layers at higher bitwidth where needed.
- Calibration can be performed even with cross-domain data, enhancing practical deployability.
By theory and empirical investigation, QDrop’s random dropping of activation quantization is shown to foster flatter, more robust quantized neural solutions, enabling efficient and accurate sub-4-bit PTQ for real-world deep learning tasks—achieving, for the first time, practical 2-bit activation PTQ across both computer vision and LLMs (Wei et al., 2022).