Zero-shot point cloud segmentation by transferring geometric primitives (2210.09923v3)
Abstract: We investigate transductive zero-shot point cloud semantic segmentation, where the network is trained on seen objects and able to segment unseen objects. The 3D geometric elements are essential cues to imply a novel 3D object type. However, previous methods neglect the fine-grained relationship between the language and the 3D geometric elements. To this end, we propose a novel framework to learn the geometric primitives shared in seen and unseen categories' objects and employ a fine-grained alignment between language and the learned geometric primitives. Therefore, guided by language, the network recognizes the novel objects represented with geometric primitives. Specifically, we formulate a novel point visual representation, the similarity vector of the point's feature to the learnable prototypes, where the prototypes automatically encode geometric primitives via back-propagation. Besides, we propose a novel Unknown-aware InfoNCE Loss to fine-grained align the visual representation with language. Extensive experiments show that our method significantly outperforms other state-of-the-art methods in the harmonic mean-intersection-over-union (hIoU), with the improvement of 17.8\%, 30.4\%, 9.2\% and 7.9\% on S3DIS, ScanNet, SemanticKITTI and nuScenes datasets, respectively. Codes are available (https://github.com/runnanchen/Zero-Shot-Point-Cloud-Segmentation)
- Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017).
- SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV).
- Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 32 (2019).
- nuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019).
- Unsupervised learning of intrinsic structural representation points. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9121–9130.
- Runnan Chen. 2023. Studies on attention modeling for visual understanding. HKU Theses Online (HKUTO) (2023).
- Towards Label-free Scene Understanding by Vision Foundation Models. arXiv preprint arXiv:2306.03899 (2023).
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020–7030.
- Towards 3d scene understanding by referring synthetic models. arXiv preprint arXiv:2203.10546 (2022).
- 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12547–12556.
- Mitigating the hubness problem for zero-shot learning of 3d objects. arXiv preprint arXiv:1907.06371 (2019).
- Transductive zero-shot learning for 3d point cloud classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 923–933.
- Zero-shot learning on 3d point cloud objects and beyond. arXiv preprint arXiv:2104.04980 (2021).
- Zero-shot learning of 3d point cloud objects. In 2019 16th International Conference on Machine Vision Applications (MVA). IEEE, 1–6.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3075–3084.
- MMDetection3D Contributors. 2020. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.
- ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE.
- Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013).
- Paraphrase generation with latent bag of words. Advances in Neural Information Processing Systems 32 (2019).
- Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1921–1929.
- Uncertainty-aware learning for zero-shot semantic segmentation. Advances in Neural Information Processing Systems 33 (2020), 21713–21724.
- Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11108–11117.
- Rethinking range view representation for lidar segmentation. arXiv preprint arXiv:2303.05367 (2023).
- Benchmarking 3D Perception Robustness to Common Corruptions and Sensor Failure. In International Conference on Learning Representations 2023 Workshop on Scene Representations for Autonomous Driving.
- Robo3d: Towards robust and reliable 3d perception against corruptions. arXiv preprint arXiv:2303.17597 (2023).
- Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems 33 (2020), 10317–10327.
- Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. arXiv preprint arXiv:2306.09347 (2023).
- See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data. arXiv preprint arXiv:2307.10782 (2023).
- Learning unbiased zero-shot semantic segmentation networks via transductive transfer. IEEE Signal Processing Letters 27 (2020), 1640–1644.
- Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds. In 2021 International Conference on 3D Vision (3DV). IEEE, 992–1002.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652–660.
- Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, sept (2010), 2487–2531.
- Ridge regression, hubness, and zero-shot learning. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 135–151.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems 33 (2020), 596–608.
- Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision. 6411–6420.
- Hanna M Wallach. 2006. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning. 977–984.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2251–2265.
- Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16024–16033.
- Human-centric Scene Understanding for 3D Large-scale Scenarios. arXiv preprint arXiv:2307.14392 (2023).
- Hui Zhang and Henghui Ding. 2021. Prototypical matching and open set rejection for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6974–6983.
- Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2021–2030.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9939–9948.