Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization (2203.05740v2)

Published 11 Mar 2022 in cs.CV and cs.AI

Abstract: Recently, post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining. Despite its low cost, current PTQ works tend to fail under the extremely low-bit setting. In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy. To deeply understand the inherent reason, a theoretical framework is established, indicating that the flatness of the optimized low-bit model on calibration and test data is crucial. Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ. Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority. With QDROP, the limit of PTQ is pushed to the 2-bit activation for the first time and the accuracy boost can be up to 51.49%. Without bells and whistles, QDROP establishes a new state of the art for PTQ. Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/ModelTC/MQBench)

Overview of "QDrop: Randomly Dropping Quantization for Extremely Low-Bit Post-Training Quantization"

The paper "QDrop: Randomly Dropping Quantization for Extremely Low-Bit Post-Training Quantization" introduces a novel approach aimed at enhancing the efficacy of post-training quantization (PTQ), particularly in low-bit scenarios. This work provides a rigorous theoretical and empirical investigation into the stability and performance of PTQ when activation quantization is integrated into the reconstruction process.

Introduction

Deep neural networks, while highly effective, often come with increased computational and memory costs, challenging the deployment on edge devices. To address this, techniques like quantization are employed, where model parameters are represented in lower precision. PTQ, in contrast to Quantization-Aware Training (QAT), offers a less computationally intensive alternative as it forgoes retraining. However, PTQ struggles under extremely low-bit conditions, especially when activation quantization is not considered.

Key Contributions

  1. Activation Quantization in PTQ: The paper identifies benefits from incorporating activation quantization into the PTQ process. Empirical results demonstrate considerable accuracy improvements when partially quantizing activations—contradicting common practices that neglect this aspect.
  2. Theoretical Framework: The authors establish a theoretical foundation elucidating why integrating activation quantization improves accuracy. Central to this is the flatness of the model's loss landscape, which is shown to be crucial for generalization on both calibration and test datasets.
  3. QDrop Methodology: The paper proposes QDrop, an innovative technique that randomly drops activation quantization during the PTQ process. This randomness seeks to optimize the model's flatness across diverse data distributions, enhancing robustness and accuracy.
  4. Extensive Validation: The efficacy of QDrop is thoroughly validated across multiple tasks including image classification, object detection, and NLP tasks. Results indicate substantial performance gains, highlighting the practicality of QDrop in pushing PTQ limits to extremely low-bit settings.

Results and Implications

QDrop establishes a new state-of-the-art for PTQ, achieving unprecedented accuracy in 2-bit activations on diverse datasets. Specifically, accuracy improvements reach up to 51.49% in some scenarios, showcasing its potential in ultra-low-bit applications. This signifies a leap forward for PTQ, making it a viable solution for deploying efficient deep learning models in resource-constrained environments.

Future Directions

The research opens avenues for further exploration into the development of PTQ techniques that consider model robustness from a wider theoretical perspective. Future work could explore understanding the interactions between network architecture and quantization dynamics within PTQ frameworks. Additionally, extensions to non-classical networks or hybrid quantization approaches may further refine this methodology.

Conclusion

The "QDrop" proposal addresses a critical problem in PTQ, demonstrating that thoughtful incorporation of activation quantization can dramatically elevate performance. By introducing a novel mechanism to achieve greater flatness and robustness, this paper marks a significant advancement in the field of neural network quantization technology. The methodology and insights provided are likely to catalyze further research, advancing the deployment of efficient and effective neural networks in practical applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiuying Wei (10 papers)
  2. Ruihao Gong (40 papers)
  3. Yuhang Li (102 papers)
  4. Xianglong Liu (128 papers)
  5. Fengwei Yu (23 papers)
Citations (132)