Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation (2312.07221v1)
Abstract: Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.
- SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV.
- Zero-shot semantic segmentation. ç (2019).
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In CVPR.
- Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives. arXiv preprint arXiv:2210.09923 (2022).
- Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In ICME.
- Mitigating the hubness problem for zero-shot learning of 3d objects. arXiv preprint arXiv:1907.06371 (2019).
- Zero-shot learning of 3d point cloud objects. In MVA.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. arXiv preprint arXiv:2109.03805 (2021).
- Point cloud interaction and manipulation in virtual reality. In AIVR.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
- Context-aware feature generation for zero-shot semantic segmentation. In ACM MM.
- Deep residual learning for image recognition. In CVPR.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).
- Language-Level Semantics Conditioned 3D Point Cloud Segmentation. arXiv preprint arXiv:2107.00430 (2021).
- 3d-to-2d distillation for indoor scene parsing. In CVPR.
- Generative zero-shot learning for semantic segmentation of 3d point clouds. In 3DV.
- Improved knowledge distillation via teacher assistant. In AAAI.
- Openscene: 3d scene understanding with open vocabularies. In CVPR.
- Learning transferable visual models from natural language supervision. In ICML.
- Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR.
- Srinet: Learning strictly rotation-invariant representations for point cloud classification and segmentation. In ACM MM.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV.
- Stanley: The robot that won the DARPA Grand Challenge. J FIELD ROBOT (2006).
- Cap2seg: Inferring semantic and spatial context from captions for zero-shot image segmentation. In ACM MM.
- Attention is all you need. NeurIPS (2017).
- Head: Hetero-assists distillation for heterogeneous object detectors. In ECCV.
- CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In ACM MM.
- Ground-aware point cloud semantic segmentation for autonomous driving. In ACM MM.
- Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models. In ECCV.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR.
- 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV.
- Open-vocabulary detr with conditional matching. In ECCV.
- Training efficient saliency prediction models with knowledge distillation. In ACM MM.
- Pointclip: Point cloud understanding by clip. In CVPR.
- Regionclip: Region-based language-image pretraining. In CVPR.
- Extract free dense labels from clip. In ECCV.
- PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning. arXiv preprint arXiv:2211.11682 (2022).
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR.
- Yuanbin Wang (3 papers)
- Shaofei Huang (19 papers)
- Yulu Gao (6 papers)
- Zhen Wang (571 papers)
- Rui Wang (996 papers)
- Kehua Sheng (8 papers)
- Bo Zhang (633 papers)
- Si Liu (130 papers)