GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding (2403.09639v1)
Abstract: Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task, which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However, this approach often results in semantically identical points having dissimilar representations, leading to a high number of false negatives and introducing a "semantic conflict" problem. To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions, which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of "semantic conflict". We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance.
- 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers. in 2021 ieee. In CVF International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- Pointcept Contributors. Pointcept: A codebase for point cloud perception research, 2023.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27, 2014.
- Efficient graph-based image segmentation. International journal of computer vision, 59:167–181, 2004.
- Clustering based point cloud representation learning for 3d analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8283–8294, 2023.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Unsupervised multi-task feature learning on point clouds. In ICCV, 2019.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Efficient visual pretraining with contrastive detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10086–10096, 2021.
- Object discovery and representation networks. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pages 123–143. Springer, 2022.
- Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
- Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6535–6545, 2021.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. CVPR, 2020.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In IROS, 2015.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
- Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
- Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In ECCV, 2020.
- Self-supervised deep learning on point clouds by reconstructing space. In NeurIPS, 2019.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
- Semantic scene completion from a single depth image. In CVPR, 2017.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
- Deep closest point: Learning representations for point cloud registration. In ICCV, 2019.
- Aligning pretraining for detection via object-level contrastive learning. Advances in Neural Information Processing Systems, 34:22682–22694, 2021.
- Self-supervised visual representation learning with semantic grouping. In Advances in Neural Information Processing Systems, 2022.
- Point transformer v2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
- Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In CVPR, 2023.
- Point transformer v3: Simpler, faster, stronger. In CVPR, 2024a.
- Towards large-scale 3d representation learning with multi-dataset point prompt training. In CVPR, 2024b.
- Unsupervised object-level representation learning from scene images. Advances in Neural Information Processing Systems, 34:28864–28876, 2021a.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021b.
- Unipad: A universal pre-training paradigm for autonomous driving. In CVPR, 2024.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023.
- Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10252–10263, 2021.
- Pointweb: Enhancing local neighborhood features for point cloud processing. In CVPR, 2019.
- Point transformer. In ICCV, 2021.
- Understanding imbalanced semantic segmentation through neural collapse. 2023.
- Ponderv2: Pave the way for 3d foundation model with a universal pre-training paradigm. arXiv preprint arXiv:2310.08586, 2023.
- Chengyao Wang (7 papers)
- Li Jiang (88 papers)
- Xiaoyang Wu (28 papers)
- Zhuotao Tian (38 papers)
- Bohao Peng (14 papers)
- Hengshuang Zhao (118 papers)
- Jiaya Jia (162 papers)