Efficient and Robust Quantization-aware Training via Adaptive Coreset Selection (2306.07215v3)
Abstract: Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations. However, most existing QAT methods require end-to-end training on the entire dataset, which suffers from long training time and high energy costs. In addition, the potential label noise in the training data undermines the robustness of QAT. We propose two metrics based on analysis of loss and gradient of quantized weights: error vector score and disagreement score, to quantify the importance of each sample during training. Guided by these two metrics, we proposed a quantization-aware Adaptive Coreset Selection (ACS) method to select the data for the current training epoch. We evaluate our method on various networks (ResNet-18, MobileNetV2, RetinaNet), datasets(CIFAR-10, CIFAR-100, ImageNet-1K, COCO), and under different quantization settings. Specifically, our method can achieve an accuracy of 68.39\% of 4-bit quantized ResNet-18 on the ImageNet-1K dataset with only a 10\% subset, which has an absolute gain of 4.24\% compared to the baseline. Our method can also improve the robustness of QAT by removing noisy samples in the training set.
- Contextual diversity for active learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI, pp. 137–153, 2020.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 696–697, 2020.
- Coresets via bilevel optimization for continual learning and streaming. Advances in Neural Information Processing Systems, 33:14879–14890, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Cross-lingual language model pretraining. Advances in neural information processing systems, 32, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841, 2018.
- Learned step size quantization. In International Conference on Learning Representations, 2020.
- Post-training piecewise linear quantization for deep neural networks. In European Conference on Computer Vision, pp. 69–86. Springer, 2020.
- Emergent properties of the local geometry of neural loss landscapes. arXiv preprint arXiv:1910.05929, 2019.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850–5861, 2020.
- Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Sdq: Stochastic differentiable quantization with mixed precision. In International Conference on Machine Learning, pp. 9295–9309. PMLR, 2022.
- Submodular combinatorial information measures with applications in machine learning. arXiv preprint arXiv:2006.15412, 2020.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
- Stripes: Bit-serial deep neural network computing. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12. IEEE, 2016.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186, 2019.
- Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464–5474. PMLR, 2021a.
- Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8110–8118, 2021b.
- Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488–14501, 2021c.
- Qkd: Quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.
- Learning multiple layers of features from tiny images. 2009.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
- Visualizing the loss landscape of neural nets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6391–6401, 2018.
- Fully quantized network for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2810–2819, 2019a.
- Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In International Conference on Learning Representations, 2019b.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
- Oscillation-free quantization for low-bit vision transformers. arXiv preprint arXiv:2302.02210, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3296–3305, 2019.
- Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp. 2736–2744, 2017.
- Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
- Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
- Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 650–663, 2021.
- Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pp. 234–243. Springer, 1978.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 5191–5198, 2020.
- Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pp. 2049–2057, 2013.
- Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020.
- Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In International Conference on Learning Representations, 2018.
- Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
- Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11264–11272, 2019.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
- An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
- Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
- Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. PMLR, 2018.
- Model compression via distillation and quantization. In International Conference on Learning Representations.
- Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, pp. 17848–17869. PMLR, 2022.
- Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
- Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 764–775. IEEE, 2018.
- A fast knowledge distillation framework for visual recognition. arXiv preprint arXiv:2112.01528, 2021.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. PMLR, 2019.
- An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019.
- Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, pp. 9847–9856. PMLR, 2020.
- Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1121–1128, 2009.
- Gert W Wolf. Facility location: concepts, models, algorithms and case studies. series: Contributions to management science, 2011.
- Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
- Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382, 2018.
- Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
- Probabilistic bilevel coreset selection. In International Conference on Machine Learning, pp. 27287–27302. PMLR, 2022.
- Xijie Huang (26 papers)
- Zechun Liu (48 papers)
- Kwang-Ting Cheng (96 papers)
- Shih-yang Liu (10 papers)