PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition (2309.13822v1)
Abstract: We develop techniques for refining representations for fine-grained classification and segmentation tasks in a self-supervised manner. We find that fine-tuning methods based on instance-discriminative contrastive learning are not as effective, and posit that recognizing part-specific variations is crucial for fine-grained categorization. We present an iterative learning approach that incorporates part-centric equivariance and invariance objectives. First, pixel representations are clustered to discover parts. We analyze the representations from convolutional and vision transformer networks that are best suited for this task. Then, a part-centric learning step aggregates and contrasts representations of parts within an image. We show that this improves the performance on image classification and part segmentation tasks across datasets. For example, under a linear-evaluation scheme, the classification accuracy of a ResNet50 trained on ImageNet using DetCon, a self-supervised learning approach, improves from 35.4% to 42.0% on the Caltech-UCSD Birds, from 35.5% to 44.1% on the FGVC Aircraft, and from 29.7% to 37.4% on the Stanford Cars. We also observe significant gains in few-shot part segmentation tasks using the proposed technique, while instance-discriminative learning was not as effective. Smaller, yet consistent, improvements are also observed for stronger networks based on transformers.
- Deep vit features as dense visual descriptors. ECCVW What is Motion For?, 2022.
- Demystifying unsupervised semantic correspondence estimation. In ECCV, 2022.
- Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
- On equivariant and invariant learning of object landmark representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- PICIE: unsupervised semantic segmentation using invariance and equivariance in clustering. In CVPR, 2021.
- Unsupervised part discovery from contrastive reconstruction. In Advances in Neural Information Processing Systems, 2021.
- When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14755–14764, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Efficient graph-based image segmentation. International journal of computer vision, 59(2):167–181, 2004.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Efficient visual pretraining with contrastive detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10086–10096, 2021.
- Object discovery and representation networks. arXiv preprint arXiv:2203.08777, 2022.
- Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
- Rethinking nearest neighbors for visual classification. arXiv preprint arXiv:2112.08459, 2021.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 1449–1457, 2015.
- Subhransu Maji. Discovering a lexicon of parts and attributes. In Computer Vision – ECCV 2012. Workshops and Demonstrations, 2012.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Evaluation of deep learning algorithms for semantic segmentation of car parts. Complex & Intelligent Systems, pages 1–13, May 2021.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- GANORCON: Are Generative Models Useful for Few-shot Segmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Improving few-shot part segmentation using coarse supervision. In ECCV, 2022.
- Object landmark discovery through unsupervised adaptation. Advances in Neural Information Processing Systems, 32, 2019.
- Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- A realistic evaluation of semi-supervised learning for fine-grained classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12966–12975, 2021.
- When does self-supervision improve few-shot learning? In European conference on computer vision, 2020.
- Revisiting pose-normalization for fine-grained few-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Unsupervised learning of landmarks by descriptor vector exchange. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
- Training data-efficient image transformers; distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
- Repurposing gans for one-shot semantic part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4475–4485, 2021.
- Understanding objects in detail with fine-grained attributes. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
- The Caltech-UCSD Birds-200-2011 dataset. 2011.
- Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Lassie: Learning articulated shapes from sparse image ensemble via 3d part discovery. In Advances in Neural Information Processing Systems, 2022.
- Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021.