Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization (2403.14973v2)
Abstract: Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different $views$ of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying $what$ an object is but also understanding $how$ it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks. We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects. Our dataset and code are available at http://pwang.pw/trajSSL/.
- Marc Alexa. Super-fibonacci spirals: Fast, low-discrepancy sampling of so (3). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8291–8300, 2022.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
- Vicregl: Self-supervised learning of local visual features. Advances in Neural Information Processing Systems, 35:8799–8810, 2022.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Occlusion-robust object pose estimation with holistic representation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2929–2939, 2022a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021a.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021b.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- The sparse manifold transform. Advances in neural information processing systems, 31, 2018.
- Bag of image patch embedding behind the success of self-supervised learning. arXiv preprint arXiv:2206.08954, 2022b.
- Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162, 2017.
- Equivariant contrastive learning. arXiv preprint arXiv:2111.00899, 2021.
- Equimod: An equivariance module to improve self-supervised learning. arXiv preprint arXiv:2211.01244, 2022.
- Whitening for self-supervised representation learning. In International Conference on Machine Learning, pages 3015–3024. PMLR, 2021.
- Explorations in homeomorphic variational auto-encoding. arXiv preprint arXiv:1807.04689, 2018.
- Peter Földiák. Learning invariance from transformation sequences. Neural computation, 3(2):194–200, 1991.
- Self-supervised learning of split invariant equivariant representations. arXiv preprint arXiv:2302.10283, 2023.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Learning to linearize under uncertainty. Advances in neural information processing systems, 28, 2015.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- A comparison of popular point configurations on s2. arXiv preprint arXiv:1607.04590, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Transforming auto-encoders. In Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, pages 44–51. Springer, 2011.
- Repose: Fast 6d object pose refinement via deep texture rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3303–3312, 2021.
- Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision, pages 1521–1529, 2017.
- Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
- Improving transferability of representations via augmentation-aware self-supervision. Advances in Neural Information Processing Systems, 34:17710–17722, 2021.
- Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Learning symmetric embeddings for equivariant world models. arXiv preprint arXiv:2204.11371, 2022.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- Self-supervised learning through efference copies. Advances in Neural Information Processing Systems, 35:4543–4557, 2022.
- Structuring representations using group invariants. Advances in Neural Information Processing Systems, 35:34162–34174, 2022.
- Carvana image masking challenge. https://kaggle.com/competitions/carvana-image-masking-challenge, 2017.
- Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254, 1985.
- Canonical capsules: Self-supervised capsules in canonical pose. Advances in Neural information processing systems, 34:24993–25005, 2021.
- Unsupervised learning of group invariant and equivariant representations. Advances in Neural Information Processing Systems, 35:31942–31956, 2022.
- Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
- What should be equivariant in self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4111–4120, 2022.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Relpose: Predicting probabilistic relative rotation for single objects in the wild. In European Conference on Computer Vision, pages 592–611. Springer, 2022.
- Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.