A Generalization Theory of Cross-Modality Distillation with Contrastive Learning (2405.03355v2)
Abstract: Cross-modality distillation arises as an important topic for data modalities containing limited knowledge such as depth maps and high-quality sketches. Such techniques are of great importance, especially for memory and privacy-restricted scenarios where labeled training data is generally unavailable. To solve the problem, existing label-free methods leverage a few pairwise unlabeled data to distill the knowledge by aligning features or statistics between the source and target modalities. For instance, one typically aims to minimize the L2 distance or contrastive loss between the learned features of pairs of samples in the source (e.g. image) and the target (e.g. sketch) modalities. However, most algorithms in this domain only focus on the experimental results but lack theoretical insight. To bridge the gap between the theory and practical method of cross-modality distillation, we first formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning that leverages both positive and negative correspondence, towards a better distillation of generalizable features. Furthermore, we establish a thorough convergence analysis that reveals that the distance between source and target modalities significantly impacts the test error on downstream tasks within the target modality which is also validated by the empirical results. Extensive experimental results show that our algorithm outperforms existing algorithms consistently by a margin of 2-3\% across diverse modalities and tasks, covering modalities of image, sketch, depth map, and audio and tasks of recognition and segmentation.
- Cross-modal knowledge transfer without task-relevant source data. In European Conference on Computer Vision, pp. 111–127. Springer, 2022.
- A theory of learning from different domains. Machine learning, 79:151–175, 2010.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020b.
- Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020c.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Seed: Self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731, 2021.
- On the provable advantage of unsupervised pretraining. arXiv preprint arXiv:2303.01566, 2023.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2827–2836, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- A comprehensive survey to dataset distillation. arXiv preprint arXiv:2301.05603, 2023.
- Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning, pp. 6028–6039. PMLR, 2020.
- Sketch-bert: Learning sketch bidirectional encoder representation from transformers by self-supervised learning of sketch gestalt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6758–6767, 2020.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Universal domain adaptation through self supervision. Advances in neural information processing systems, 33:16282–16292, 2020.
- The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
- Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pp. 19250–19286. PMLR, 2022.
- Indoor segmentation and support inference from rgbd images. In ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746–760. Springer, 2012.
- Cross-modal knowledge distillation for action recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 6–10. IEEE, 2019.
- Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp. 1179–1206. PMLR, 2021.
- On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
- One-peace: Exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172, 2023.
- Cross-domain contrastive learning for unsupervised domain adaptation. IEEE Transactions on Multimedia, 2022.
- Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pp. 11112–11122. PMLR, 2021.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742, 2018.
- Multimodal knowledge expansion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 854–863, 2021.
- The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487, 2022.
- Sketch-a-net: A deep neural network that beats humans. International journal of computer vision, 122:411–425, 2017.
- Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023.
- Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6528–6537, 2020.