Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization (2401.08860v2)
Abstract: High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent researches find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-text representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle the challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained pre-text representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on CUB-200-2011, Stanford Cars and FGVC Aircraft show that the proposed method outperforms the contemporary method by upto 10.14% and existing state-of-the-art self-supervised learning approaches by upto 19.78% on both top-1 accuracy and Rank-1 retrieval metric.
- J. Han, X. Yao, G. Cheng, X. Feng, and D. Xu, “P-cnn: Part-based convolutional neural networks for fine-grained visual categorization,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 2, pp. 579–590, 2022.
- H. Wang, J. Liao, T. Cheng, Z. Gao, H. Liu, B. Ren, X. Bai, and W. Liu, “Knowledge mining with scene text for fine-grained recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4624–4633.
- W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, J. Yang, and S.-N. Lim, “Cross-x learning for fine-grained visual categorization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8242–8251.
- W. Ge, X. Lin, and Y. Yu, “Weakly supervised complementary parts models for fine-grained image classification from the bottom up,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3034–3043.
- Y. Song, N. Sebe, and W. Wang, “On the eigenvalues of global covariance pooling for fine-grained visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3554–3566, 2023.
- X.-S. Wei, Y.-Z. Song, O. M. Aodha, J. Wu, Y. Peng, J. Tang, J. Yang, and S. Belongie, “Fine-grained image analysis with deep learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8927–8948, 2022.
- Q. Bi, S. You, and T. Gevers, “Interactive learning of intrinsic and extrinsic properties for all-day semantic segmentation,” IEEE Transactions on Image Processing, 2023.
- J. Pan, Q. Bi, Y. Yang, P. Zhu, and C. Bian, “Label-efficient hybrid-supervised learning for medical image segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2026–2034.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, 2020, pp. 1597–1607.
- X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758.
- X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
- E. Cole, X. Yang, K. Wilber, O. Mac Aodha, and S. Belongie, “When does contrastive visual representation learning work?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 755–14 764.
- Y. Shu, A. van den Hengel, and L. Liu, “Learning common rationale to improve self-supervised representation for fine-grained visual recognition problems,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 392–11 401.
- X.-S. Wei, H.-J. Ye, X. Mu, J. Wu, C. Shen, and Z.-H. Zhou, “Multi-instance learning with emerging novel class,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 2109–2120, 2019.
- M. Ilse, J. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” in International conference on machine learning, 2018, pp. 2127–2136.
- X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance neural networks,” Pattern Recognition, vol. 74, pp. 15–24, 2018.
- X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu, “Max-margin multiple-instance dictionary learning,” in International conference on machine learning, 2013, pp. 846–854.
- X. Wang, Z. Zhu, C. Yao, and X. Bai, “Relaxed multiple-instance svm with application to object discovery,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1224–1232.
- P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection network with online instance classifier refinement,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2843–2851.
- C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, and J. Gao, “Efficient self-supervised vision transformers for representation learning,” in International Conference on Learning Representations, 2022.
- L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of granularity-specific experts for fine-grained categorization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8331–8340.
- H. Touvron, A. Vedaldi, M. Douze, and H. Jégou, “Fixing the train-test resolution discrepancy,” in Advances in neural information processing systems, vol. 32, 2019.
- R. Ji, L. Wen, L. Zhang, D. Du, Y. Wu, C. Zhao, X. Liu, and F. Huang, “Attention convolutional binary neural tree for fine-grained visual categorization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 468–10 477.
- R. Du, D. Chang, A. K. Bhunia, J. Xie, Z. Ma, Y.-Z. Song, and J. Guo, “Fine-grained visual classification via progressive multi-granularity training of jigsaw patches,” in European Conference on Computer Vision, 2020, pp. 153–168.
- Y. Zhao, J. Li, X. Chen, and Y. Tian, “Part-guided relational transformers for fine-grained visual recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 9470–9481, 2021.
- A. Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik, “Pairwise confusion for fine-grained visual classification,” in European conference on computer vision, 2018, pp. 70–86.
- D. Chang, Y. Tong, R. Du, T. Hospedales, Y.-Z. Song, and Z. Ma, “An erudite fine-grained visual classification model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7268–7277.
- Y. Hu, X. Jin, Y. Zhang, H. Hong, J. Zhang, Y. He, and H. Xue, “Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4239–4248.
- J. He, J.-N. Chen, S. Liu, A. Kortylewski, C. Yang, Y. Bai, C. Wang, and A. Yuille, “Transfg: A transformer architecture for fine-grained recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2026–2034.
- Y. Gao, X. Han, X. Wang, W. Huang, and M. Scott, “Channel interaction networks for fine-grained image categorization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 818–10 825.
- P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise interaction for fine-grained classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 130–13 137.
- A. Behera, Z. Wharton, P. Hewage, and A. Bera, “Context-aware attentional pooling (cap) for fine-grained visual classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 929–937.
- Y. Rao, G. Chen, J. Lu, and J. Zhou, “Counterfactual attention learning for fine-grained visual categorization and re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1025–1034.
- H. Zhu, W. Ke, D. Li, J. Liu, L. Tian, and Y. Shan, “Dual cross-attention learning for fine-grained visual categorization and object re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4692–4702.
- X.-S. Wei, Y. Shen, X. Sun, H.-J. Ye, and J. Yang, “A2-net: Learning attribute-aware hash codes for large-scale fine-grained image retrieval,” Advances in Neural Information Processing Systems, vol. 34, pp. 5720–5730, 2021.
- Y. Shen, X. Sun, X.-S. Wei, Q.-Y. Jiang, and J. Yang, “Semicon: A learning-to-hash solution for large-scale fine-grained image retrieval,” in European Conference on Computer Vision, 2022, pp. 531–548.
- L. Yang, X. Li, R. Song, B. Zhao, J. Tao, S. Zhou, J. Liang, and J. Yang, “Dynamic mlp for fine-grained image classification by leveraging geographical and temporal information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 945–10 954.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning, 2021, pp. 12 310–12 320.
- A. Bardes, J. Ponce, and Y. Lecun, “Vicreg: Variance-invariance-covariance regularization for self-supervised learning,” in International Conference on Learning Representations, 2022.
- S. Kim, S. Bae, and S.-Y. Yun, “Coreset sampling from open-set for fine-grained self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7537–7547.
- Y. Shu, B. Yu, H. Xu, and L. Liu, “Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism,” in European Conference of Computer Vision, 2022, pp. 449–465.
- N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “Distilling localization for self-supervised representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 990–10 998.
- D. Wu, S. Li, Z. Zang, K. Wang, L. Shang, B. Sun, H. Li, and S. Z. Li, “Align yourself: Self-supervised pre-training for fine-grained recognition via saliency alignment,” arXiv preprint arXiv:2106.15788, 2021.
- L. Huang, S. You, M. Zheng, F. Wang, C. Qian, and T. Yamasaki, “Learning where to learn in cross-view self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 451–14 460.
- X. Peng, K. Wang, Z. Zhu, M. Wang, and Y. You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 031–16 040.
- Y. Yan, X. Wang, X. Guo, J. Fang, W. Liu, and J. Huang, “Deep multi-instance learning with dynamic pooling,” in Asian Conference on Machine Learning, 2018, pp. 662–677.
- Q. Bi, K. Qin, H. Zhang, and G.-S. Xia, “Local semantic enhanced convnet for aerial scene recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 6498–6511, 2021.
- S. Yu, K. Ma, Q. Bi, C. Bian, M. Ning, N. He, Y. Li, H. Liu, and Y. Zheng, “Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification,” in Medical Image Computing and Computer Assisted Intervention, 2021, pp. 45–54.
- X. Wang, Y. Yan, P. Tang, W. Liu, and X. Guo, “Bag similarity network for deep multi-instance learning,” Information Sciences, vol. 504, pp. 578–588, 2019.
- R. Zhang, Q. Zhang, Y. Liu, H. Xin, Y. Liu, and X. Wang, “Multi-level multiple instance learning with transformer for whole slide image classification,” arXiv preprint arXiv:2306.05029, 2023.
- Q. Bi, K. Qin, Z. Li, H. Zhang, K. Xu, and G.-S. Xia, “A multiple-instance densely-connected convnet for aerial scene classification,” IEEE Transactions on Image Processing, vol. 29, pp. 4911–4926, 2020.
- Q. Bi, B. Zhou, K. Qin, Q. Ye, and G.-S. Xia, “All grains, one scheme (agos): Learning multigrain instance representation for aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2022.
- P. Tang, X. Wang, Z. Huang, X. Bai, and W. Liu, “Deep patch learning for weakly supervised object classification and discovery,” Pattern Recognition, vol. 71, pp. 446–459, 2017.
- T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4433–4442.
- M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” Advances in neural information processing systems, vol. 30, 2017.
- P. Tang, X. Wang, B. Feng, and W. Liu, “Learning multi-instance deep discriminative patterns for image classification,” IEEE transactions on image processing, vol. 26, no. 7, pp. 3385–3396, 2016.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
- P. Hall, “On kullback-leibler loss and density estimation,” The Annals of Statistics, pp. 1491–1519, 1987.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
- P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2998–3008.