InDistill: Information flow-preserving knowledge distillation for model compression (2205.10003v4)
Abstract: In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved via a training scheme based on curriculum learning that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher's intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrating that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings. The code is available at https://github.com/gsarridis/InDistill.
- A comprehensive survey on model compression and acceleration. Artificial Intelligence Review, 53(7):5113–5155, 2020.
- Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Distilling the knowledge in a neural network. stat, 1050:9, 2015.
- Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2339–2348, 2020.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, 2017.
- Fitnets: Hints for thin deep nets. In 3rd International Conference on Learning Representations, 2015.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
- Resnet can be pruned 60×\times×: Introducing network purification and unused path removal (p-rm) after weight pruning. In IEEE/ACM International Symposium on Nanoscale Architectures, pages 1–2. IEEE, 2019.
- Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, 2017.
- Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
- Resrep: Lossless cnn pruning via decoupling remembering and forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pages 4510–4520, 2021.
- Dynamical channel pruning by conditional accuracy change for deep neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(2):799–813, 2020.
- Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14913–14922, 2021.
- Combining weight pruning and knowledge distillation for cnn compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3191–3198, 2021.
- Shi Chen and Qi Zhao. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):3048–3056, 2018.
- Critical learning periods in deep networks. In 7th International Conference on Learning Representations, 2019.
- Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.
- The caltech-ucsd birds-200-2011 dataset. CNS-TR-2010-001, California Institute of Technology, 2011.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006.
- Transferring knowledge to smaller network with class-distance loss. In 5th International Conference on Learning Representations Workshop, 2017.
- When does label smoothing help? Advances in Neural Information Processing Systems, 32, 2019.
- Conditional teacher-student learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6445–6449. IEEE, 2019.
- Knowledge distillation by on-the-fly native ensemble. Advances in Neural Information Processing Systems, 31, 2018.
- Learning student-friendly teacher networks for knowledge distillation. Advances in Neural Information Processing Systems, 34, 2021.
- Comprehensive knowledge distillation with causal intervention. Advances in Neural Information Processing Systems, 34, 2021.
- Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision, pages 268–284, 2018.
- Hierarchical self-supervised augmented knowledge distillation. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1217–1223, 2021.
- Contrastive representation distillation. In 7th International Conference on Learning Representations, 2019.
- Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems, 31, 2018.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5191–5198, 2020.
- Automated curriculum learning for neural networks. In Proceedings of the International Conference on Machine Learning, pages 1311–1320, 2017.
- Curriculum learning: A survey. International Journal of Computer Vision, pages 1–40, 2022.
- Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
- How do humans teach: On curriculum learning and teaching dimension. Advances in Neural Information Processing Systems, 24, 2011.
- Curriculum learning for vision-and-language navigation. Advances in Neural Information Processing Systems, 34, 2021.
- Curriculum learning. In Proceedings of the 26th annual International Conference on Machine Learning, pages 41–48, 2009.
- Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
- Teacher–student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2019.
- Curriculum learning of multiple tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5492–5500, 2015.
- Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
- Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28, 2015.
- Knowledge distillation via softmax regression representation learning. In 9th International Conference on Learning Representations, 2021.
- Learning metrics from teachers: Compact networks for image embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2907–2916, 2019.
- Accumulated gradient normalization. In Asian Conference on Machine Learning, pages 439–454. PMLR, 2017.