Towards Accurate Post-training Quantization for Reparameterized Models (2402.16121v1)
Abstract: Model reparameterization is a widely accepted technique for improving inference speed without compromising performance. However, current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation when applied to reparameterized models. This is primarily caused by channel-specific and sample-specific outliers, which appear only at specific samples and channels and impact on the selection of quantization parameters. To address this issue, we propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models. Different from previous frameworks using Mean Squared Error (MSE) as a measurement, we utilize Mean Absolute Error (MAE) to mitigate the influence of outliers on quantization parameters. Our framework comprises two main components: Quantization Protecting Reparameterization and Across-block Calibration. For effective calibration, Quantization Protecting Reparameterization combines multiple branches into a single convolution with an affine layer. During training, the affine layer accelerates convergence and amplifies the output of the convolution to better accommodate samples with outliers. Additionally, Across-block Calibration leverages the measurement of stage output as supervision to address the gradient problem introduced by MAE and enhance the interlayer correlation with quantization parameters. Comprehensive experiments demonstrate the effectiveness of RepAPQ across various models and tasks. Our framework outperforms previous methods by approximately 1\% for 8-bit PTQ and 2\% for 6-bit PTQ, showcasing its superior performance. The code is available at \url{https://github.com/ilur98/DLMC-QUANT}.
- Batchquant: Quantized-for-all architecture search with robust quantizer. Advances in Neural Information Processing Systems, 34:1074–1085, 2021.
- Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, 2019.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 696–697, 2020.
- Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018. IEEE, 2019.
- Make repvgg greater again: A quantization-aware approach. arXiv preprint arXiv:2212.01593, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1911–1920, 2019.
- Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition. arXiv preprint arXiv:2105.01883, 2021.
- Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10886–10895, 2021.
- Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733–13742, 2021.
- Re-parameterizing your optimizers rather than architectures. arXiv preprint arXiv:2205.15242, 2022.
- Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11963–11975, 2022.
- Learned step size quantization. In International Conference on Learning Representations, 2020.
- Training with quantization noise for extreme model compression. arXiv preprint arXiv:2004.07320, 2020.
- Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.
- Online convolutional re-parameterization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 568–577, 2022.
- Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021.
- Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6. IEEE, 2014.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
- Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976, 2022.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
- Up or down? adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568, 2020.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- An improved one millisecond mobile backbone. arXiv preprint arXiv:2206.04040, 2022.
- Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, pages 9847–9856. PMLR, 2020.
- Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696, 2022.
- Yan Wang. Edge-enhanced feature distillation network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 777–785, 2022.
- Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization. arXiv preprint arXiv:2203.05740, 2022.
- Easyquant: Post-training quantization via scale optimization. arXiv preprint arXiv:2006.16669, 2020.
- Model compression using progressive channel pruning. IEEE Transactions on Circuits and Systems for Video Technology, 31(3):1114–1124, 2020.
- From simulated to visual data: A robust low-rank tensor completion approach using ℓℓ\ellroman_ℓ p-regression for outlier resistance. IEEE Transactions on Circuits and Systems for Video Technology, 32(6):3462–3474, 2021.
- 3d-pruning: A model compression framework for efficient 3d action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8717–8729, 2022.
- Diracnets: Training very deep neural networks without skip-connections. arXiv preprint arXiv:1706.00388, 2017.
- Edge-oriented convolution block for real-time super resolution on mobile devices. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4034–4043, 2021.
- Opt: Open pre-trained transformer language models arXiv preprint arXiv:2205.01068, 2022.
- Bloom: A 176b-parameter open-access multilingual language model arXiv preprint arXiv:2211.05100, 2022.
- Smoothquant: Accurate and efficient post-training quantization for large language models arXiv preprint arXiv:2211.10438, 2022.
- Outlier suppression: Pushing the limit of low-bit transformer language models In arXiv preprint arXiv:2209.13325, 2022.
- Towards compact transformers for end-to-end object detection with decomposed chain tensor structure. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Improving extreme low-bit quantization with soft threshold. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- WJ Dixon. Processing data for outliers. Biometrics, 9(1):74–89, 1953.
- Tensor factorization for low-rank tensor completion. IEEE Transactions on Image Processing, 27(3):1152–1163, 2017.
- Luoming Zhang (7 papers)
- Yefei He (19 papers)
- Wen Fei (4 papers)
- Zhenyu Lou (5 papers)
- Weijia Wu (47 papers)
- YangWei Ying (1 paper)
- Hong Zhou (61 papers)