3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation (2401.02402v3)
Abstract: 3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin.
- 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. IEEE Robotics and Automation Letters, 5(4):5432–5439, 2020.
- SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV, 2019.
- Towards open world recognition. In CVPR, 2015.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, 2023.
- Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. arXiv preprint arXiv:2305.08776, 2023.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023a.
- Open-vocabulary universal image segmentation with maskclip. In ICML, 2023b.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
- Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Open-vocabulary object detection via vision and language knowledge distillation. ICLR, 2022.
- Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In CoRL, 2022.
- Clip-s4: Language-guided self-supervised semantic segmentation. In CVPR, 2023.
- Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. arXiv preprint arXiv:2303.11313, 2023.
- Lidar-based panoptic segmentation via dynamic shifting network. In CVPR, 2021.
- Learning semantic segmentation of large-scale point clouds with random sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8338–8354, 2021.
- Openclip, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- Mseg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
- Language-driven semantic segmentation. In ICLR, 2022a.
- Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap. In CVPR, 2022b.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022c.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Focal loss for dense object detection. In ICCV, 2017.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022a.
- A convnet for the 2020s. In CVPR, 2022b.
- Decoupled weight decay regularization. In ICLR, 2019.
- Open-vocabulary semantic segmentation with frozen vision-language models. BMVC, 2022.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017b.
- Freeseg: Unified, universal and open-vocabulary image segmentation. In CVPR, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Gp-s3net: Graph-based panoptic sparse semantic segmentation network. In ICCV, 2021.
- Efficientlps: Efficient lidar panoptic segmentation. IEEE Transactions on Robotics, 38(3):1894–1914, 2021.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Attention is all you need. In NeurIPS, 2017.
- Pointconvformer: Revenge of the point-based convolution. In CVPR, 2023.
- Position-guided point cloud panoptic segmentation transformer. arXiv preprint, 2023.
- Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In ICCV, 2021.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022b.
- Sparse cross-scale attention network for efficient lidar panoptic segmentation. In AAAI, 2022c.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
- Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
- Extract free dense labels from clip. In ECCV, 2022.
- Panoptic-polarnet: Proposal-free lidar point cloud panoptic segmentation. In CVPR, 2021.
- Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023.
- Generalized decoding for pixel, image, and language. In CVPR, 2023.
- Zihao Xiao (18 papers)
- Longlong Jing (23 papers)
- Shangxuan Wu (8 papers)
- Alex Zihao Zhu (13 papers)
- Jingwei Ji (16 papers)
- Chiyu Max Jiang (9 papers)
- Wei-Chih Hung (25 papers)
- Thomas Funkhouser (66 papers)
- Weicheng Kuo (23 papers)
- Anelia Angelova (61 papers)
- Yin Zhou (32 papers)
- Shiwei Sheng (3 papers)