LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition (2403.04066v1)
Abstract: Self-supervised contrastive learning strategy has attracted remarkable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we present to incorporate the subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called Local Discrimination (LoDisc) is proposed to explicitly supervise self-supervised model's focus towards local pivotal regions which are captured by a simple-but-effective location-wise mask sampling strategy. We show that Local Discrimination pretext task can effectively enhance fine-grained clues in important local regions, and the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. Meanwhile, the proposed method is also effective in general object recognition tasks.
- Towards efficient and effective self-supervised learning of visual representations. In Proceedings of European Conference on Computer Vision (ECCV), pages 523–538, 2022.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning, 2021. arXiv preprint arXiv:2105.04906.
- Signature verification using a” siamese” time delay neural network. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 1993.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 9650–9660, 2021.
- Symbiotic segmentation and part localization for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 321–328, 2013.
- Cf-vit: A general coarse-to-fine method for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 7042–7052, 2023.
- A simple framework for contrastive learning of visual representations. In Proceedings of the International conference on Machine Learning (ICML), page 1597–1607, 2020a.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 15750–15758, 2021.
- Improved baselines with momentum contrastive learning, 2020b. arXiv preprint arXiv:2003.04297.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
- When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14755–14764, 2022.
- Align yourself: Self-supervised pre-training for fine-grained recognition via saliency alignment, 2021. arXiv preprint arXiv:2106.15788.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5414–5423, 2021.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 178–178, 2004.
- Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), page 21271–21284, 2020.
- Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1742, 2006.
- Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 852–860, 2022a.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022b.
- Beyond the parts: learning coarse-to-fine adaptive alignment representation for person search. ACM Trans. Multimedia Comput. Commun. Appl., 19(3), 2023.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pages 554–561, 2013.
- Efficient self-supervised vision transformers for representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Attentionshift: Iteratively estimated part-based attention map for pointly supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19519–19528, 2023.
- Fine-grained visual classification of aircraft, 2013. arXiv preprint arXiv:1306.5151.
- Focus on details: Online multi-object tracking with diverse fine-grained representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11289–11298, 2023.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
- Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, 2019.
- Learning common rationale to improve self-supervised representation for fine-grained visual recognition problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11392–11401, 2023.
- Weakly supervised posture mining for fine-grained classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23735–23744, 2023.
- Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
- Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
- The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.
- Repre: Improving self-supervised vision transformer with reconstructive pre-training. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), pages 1437–1443, 2022.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
- The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 842–850, 2015.
- Self-supervised learning with swin transformers, 2021. arXiv preprint arXiv:2105.04553.
- Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Multimedia, pages 1–14, 2023.
- Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning (ICML), page 12310–12320, 2021.
- Patch-level contrastive learning via positional query for visual pre-training. In Proceedings of the International conference on Machine Learning (ICML), 2023a.
- A free lunch from vit: Adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3234–3238, 2022.
- S3mix: Same category same semantics mixing for augmenting fine-grained images. ACM Trans. Multimedia Comput. Commun. Appl., 20(1), 2023b.
- Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Image Processing, 30:9470–9481, 2021.
- Ibot: Image bert pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11774–11783, 2020.
- Jialu Shi (2 papers)
- Zhiqiang Wei (89 papers)
- Jie Nie (13 papers)
- Lei Huang (175 papers)