Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures (2405.18524v1)
Abstract: Knowledge distillation is commonly employed to compress neural networks, reducing the inference costs and memory footprint. In the scenario of homogenous architecture, feature-based methods have been widely validated for their effectiveness. However, in scenarios where the teacher and student models are of heterogeneous architectures, the inherent differences in feature representation significantly degrade the performance of these methods. Recent studies have highlighted that low-frequency components constitute the majority of image features. Motivated by this, we propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation between heterogeneous architectures. Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models, aligning them in a compact space to overcome architectural disparities. Moreover, leveraging the intrinsic pairing characteristic of the teacher-student framework, we design an innovative sample-level contrastive learning framework that adeptly restructures the constraints of within-sample feature similarity and between-sample feature divergence into a contrastive learning task. This strategy enables the student model to capitalize on intra-sample feature congruence while simultaneously enhancing the discrimination of features among disparate samples. Consequently, our LFCC framework accurately captures the commonalities in feature representation across heterogeneous architectures. Extensive evaluations and empirical analyses across three architectures (CNNs, Transformers, and MLPs) demonstrate that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100. All codes will be publicly available.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
- Knowledge distillation from a stronger teacher. Advances in Neural Information Processing Systems, 35:33716–33727, 2022.
- One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation. In Advances in Neural Information Processing Systems, 2023.
- Fitnets: Hints for thin deep nets. In International conference on machine learning. PMLR, 2015.
- Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019.
- Contrastive representation distillation. In International Conference on Learning Representations, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Knowledge amalgamation from heterogeneous networks by common feature learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3087–3093, 2019.
- Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019.
- Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2339–2348, 2020.
- Promoting cnns with cross-architecture knowledge distillation for efficient monocular depth estimation. arXiv preprint arXiv:2404.16386, 2024.
- Vtc-lfc: Vision transformer compression with low-frequency components. Advances in Neural Information Processing Systems, 35:13974–13988, 2022.
- Head-free lightweight semantic segmentation with linear transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 516–524, 2023.
- Knowledge distillation for object detection via rank mimicking and prediction-guided feature imitation. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 1306–1313, 2022.
- Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. arXiv preprint arXiv:2305.05010, 2023.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
- Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Pixel-wise contrastive distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16359–16369, 2023.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10915–10924, 2022.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017.
- Class attention transfer based knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11868–11877, 2023.
- Masked generative distillation. In European Conference on Computer Vision, pages 53–69. Springer, 2022.
- Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4643–4652, 2022.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Hivit: A simpler and more efficient design of hierarchical vision transformer. In International Conference on Learning Representations, 2023.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In Neural Information Processing Systems (NeurIPS), 2021.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022.
- Pay attention to mlps. Advances in neural information processing systems, 34:9204–9215, 2021.
- Repmlpnet: Hierarchical vision mlp with re-parameterized locality. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 578–587, 2022.
- Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Learning multiple layers of features from tiny images. 2009.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021.
- Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
- Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
- Hongjun Wu (12 papers)
- Li Xiao (85 papers)
- Xingkuo Zhang (1 paper)
- Yining Miao (4 papers)