EPTQ: Enhanced Post-Training Quantization via Hessian-guided Network-wise Optimization (2309.11531v2)
Abstract: Quantization is a key method for deploying deep neural networks on edge devices with limited memory and computation resources. Recent improvements in Post-Training Quantization (PTQ) methods were achieved by an additional local optimization process for learning the weight quantization rounding policy. However, a gap exists when employing network-wise optimization with small representative datasets. In this paper, we propose a new method for enhanced PTQ (EPTQ) that employs a network-wise quantization optimization process, which benefits from considering cross-layer dependencies during optimization. EPTQ enables network-wise optimization with a small representative dataset using a novel sample-layer attention score based on a label-free Hessian matrix upper bound. The label-free approach makes our method suitable for the PTQ scheme. We give a theoretical analysis for the said bound and use it to construct a knowledge distillation loss that guides the optimization to focus on the more sensitive layers and samples. In addition, we leverage the Hessian upper bound to improve the weight quantization parameters selection by focusing on the more sensitive elements in the weight tensors. Empirically, by employing EPTQ we achieve state-of-the-art results on various models, tasks, and datasets, including ImageNet classification, COCO object detection, and Pascal-VOC for semantic segmentation.
- Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58(2):1–34, 2011.
- Scalable methods for 8-bit training of neural networks. Advances in Neural Information Processing Systems, 31, 2018.
- Post-training 4-bit quantization of convolution networks for rapid-deployment. arXiv preprint arXiv:1810.05723, 2018.
- ACIQ: Analytical clipping for integer quantization of neural networks. In Proceedings of the International Conference on Learning Representations, 2019.
- Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations, 2019.
- Proxylessnas: Direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations, 2018.
- ZeroQ: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. In Advances in Neural Information Processing Systems, pages 18518–18529, 2020.
- HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
- Learned step size quantization. In Proceedings of the International Conference on Learning Representations, 2020.
- Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the International Conference on Learning Representations, 2021.
- A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1638–1647, 2018.
- Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746. PMLR, 2015.
- Hmq: Hardware friendly mixed precision quantization block for cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pages 448–463, 2020.
- HPTQ: Hardware-friendly post training quantization. arXiv preprint arXiv:2109.09113, 2021.
- Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, 5, 1992.
- Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 304–320, 2018.
- Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021.
- Improving post training neural quantization: Layer-wise calibration and integer programming. In Proceedings of the International Conference on Learning Representations, 2021.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
- Genie: Show me the data for quantization. arXiv preprint arXiv:2212.04780, 2022.
- QKD: Quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.
- Position-based scaled gradient for model quantization and pruning. Advances in Neural Information Processing Systems, 33:20415–20426, 2020.
- BRECQ: Pushing the limit of post-training quantization by block reconstruction. In Proceedings of the International Conference on Learning Representations, 2021.
- Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2980–2988, 2017.
- On the variance of the adaptive learning rate and beyond. In Proceedings of the International Conference on Learning Representations, 2019.
- TorchVision maintainers and contributors. TorchVision: PyTorch’s Computer Vision library, Nov. 2016.
- Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In Proceedings of the International Conference on Learning Representations, 2021.
- Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In International Conference on Machine Learning. PMLR, 2018.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
- Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
- Overcoming oscillations in quantization-aware training. arXiv preprint arXiv:2203.11086, 2022.
- Loss aware post-training quantization. Machine Learning, 110(11):3245–3262, 2021.
- Hai Victor Habi Ofir Gordon. EPTQ. https://github.com/ssi-research/eptq, 2023.
- Model compression via distillation and quantization. In Proceedings of the International Conference on Learning Representations, 2018.
- Tensor norm, cubic power and gelfand limit. arXiv preprint arXiv:1909.10942, 2019.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
- QDrop: Randomly dropping quantization for extremely low-bit post-training quantization. In Proceedings of the International Conference on Learning Representations, 2021.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- RAPQ: Rescuing accuracy for power-of-two low-bit post-training quantization. arXiv preprint arXiv:2204.12322, 2022.
- Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pages 11875–11886. PMLR, 2021.
- Pyhessian: Neural networks through the lens of the hessian. In IEEE International Conference on Big Data, pages 581–590, 2020.
- Hessian-aware pruning and optimal neural implant. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3880–3891, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.