Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition (2403.04066v1)

Published 6 Mar 2024 in cs.CV

Abstract: Self-supervised contrastive learning strategy has attracted remarkable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we present to incorporate the subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called Local Discrimination (LoDisc) is proposed to explicitly supervise self-supervised model's focus towards local pivotal regions which are captured by a simple-but-effective location-wise mask sampling strategy. We show that Local Discrimination pretext task can effectively enhance fine-grained clues in important local regions, and the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. Meanwhile, the proposed method is also effective in general object recognition tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Towards efficient and effective self-supervised learning of visual representations. In Proceedings of European Conference on Computer Vision (ECCV), pages 523–538, 2022.
  2. Vicreg: Variance-invariance-covariance regularization for self-supervised learning, 2021. arXiv preprint arXiv:2105.04906.
  3. Signature verification using a” siamese” time delay neural network. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 1993.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 9650–9660, 2021.
  5. Symbiotic segmentation and part localization for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 321–328, 2013.
  6. Cf-vit: A general coarse-to-fine method for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 7042–7052, 2023.
  7. A simple framework for contrastive learning of visual representations. In Proceedings of the International conference on Machine Learning (ICML), page 1597–1607, 2020a.
  8. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 15750–15758, 2021.
  9. Improved baselines with momentum contrastive learning, 2020b. arXiv preprint arXiv:2003.04297.
  10. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
  11. When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14755–14764, 2022.
  12. Align yourself: Self-supervised pre-training for fine-grained recognition via saliency alignment, 2021. arXiv preprint arXiv:2106.15788.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  14. How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5414–5423, 2021.
  15. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 178–178, 2004.
  16. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), page 21271–21284, 2020.
  17. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1742, 2006.
  18. Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 852–860, 2022a.
  19. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 9729–9738, 2020.
  20. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022b.
  21. Beyond the parts: learning coarse-to-fine adaptive alignment representation for person search. ACM Trans. Multimedia Comput. Commun. Appl., 19(3), 2023.
  22. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pages 554–561, 2013.
  23. Efficient self-supervised vision transformers for representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  24. Attentionshift: Iteratively estimated part-based attention map for pointly supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19519–19528, 2023.
  25. Fine-grained visual classification of aircraft, 2013. arXiv preprint arXiv:1306.5151.
  26. Focus on details: Online multi-object tracking with diverse fine-grained representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11289–11298, 2023.
  27. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  28. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
  29. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, 2019.
  30. Learning common rationale to improve self-supervised representation for fine-grained visual recognition problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11392–11401, 2023.
  31. Weakly supervised posture mining for fine-grained classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23735–23744, 2023.
  32. Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
  33. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
  34. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.
  35. Repre: Improving self-supervised vision transformer with reconstructive pre-training. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI), pages 1437–1443, 2022.
  36. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
  37. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 842–850, 2015.
  38. Self-supervised learning with swin transformers, 2021. arXiv preprint arXiv:2105.04553.
  39. Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Multimedia, pages 1–14, 2023.
  40. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning (ICML), page 12310–12320, 2021.
  41. Patch-level contrastive learning via positional query for visual pre-training. In Proceedings of the International conference on Machine Learning (ICML), 2023a.
  42. A free lunch from vit: Adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3234–3238, 2022.
  43. S3mix: Same category same semantics mixing for augmenting fine-grained images. ACM Trans. Multimedia Comput. Commun. Appl., 20(1), 2023b.
  44. Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Image Processing, 30:9470–9481, 2021.
  45. Ibot: Image bert pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  46. Look-into-object: Self-supervised structure modeling for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11774–11783, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jialu Shi (2 papers)
  2. Zhiqiang Wei (89 papers)
  3. Jie Nie (13 papers)
  4. Lei Huang (175 papers)

Summary

We haven't generated a summary for this paper yet.