SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Instance Segmentation (2311.17707v2)
Abstract: We introduce SAMPro3D for zero-shot instance segmentation of 3D scenes. Given the 3D point cloud and multiple posed RGB-D frames of 3D scenes, our approach segments 3D instances by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating SAM prompts in 3D to align their projected pixel prompts across frames, ensuring the view consistency of SAM-predicted masks. Moreover, we suggest selecting prompts from the initial set guided by the information of SAM-predicted masks across all views, which enhances the overall performance. We further propose to consolidate different prompts if they are segmenting different surface parts of the same 3D instance, bringing a more comprehensive segmentation. Notably, our method does not require any additional training. Extensive experiments on diverse benchmarks show that our method achieves comparable or better performance compared to previous zero-shot or fully supervised approaches, and in many cases surpasses human annotations. Furthermore, since our fine-grained predictions often lack annotations in available datasets, we present ScanNet200-Fine50 test data which provides fine-grained annotations on 50 scenes from ScanNet200 dataset. The project page can be accessed at https://mutianxu.github.io/sampro3d/.
- 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Segment anything in 3d with nerfs. In NeurIPS, 2023.
- Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV, 2020.
- Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, 2023a.
- 4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. In ECCV, 2022.
- Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, 2023b.
- Sspc-net: Semi-supervised semantic 3d point cloud segmentation network. In AAAI, 2021.
- Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In ECCV, 2022.
- 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR, 2019.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding. In CVPR, 2023.
- Efficient graph-based image segmentation. IJCV, 2004.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- Learning 3d semantic segmentation with only 2d image supervision. In 3DV, 2021.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
- Point-wise Convolutional Neural Network. In CVPR, 2018.
- Scenenn: A scene meshes dataset with annotations. In 3DV, 2016.
- Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023.
- Spatio-temporal self-supervised representation learning for 3d point clouds. In ICCV, 2021.
- Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, 2019.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, 2020.
- Segment anything in high quality. In NeurIPS, 2023.
- Segment anything. In ICCV, 2023.
- Virtual multi-view fusion for 3d semantic segmentation. In ECCV, 2020.
- MSeg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
- PointCNN: Convolution on X-transformed Points. In NeurIPS, 2018.
- Openrooms: An open framework for photorealistic indoor scene datasets. In CVPR, 2021.
- Fpconv: Learning local flattening for point convolution. In CVPR, 2020.
- Weakly supervised 3d scene segmentation with region-level boundary awareness and instance discrimination. In ECCV, 2022.
- Segment any point cloud sequences by distilling vision foundation models. In NeurIPS, 2023.
- A closer look at local aggregation operators in point cloud analysis. In ECCV, 2020.
- Group-free 3d object detection via transformers. In ICCV, 2021.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. In ICLR, 2022.
- Feature-realistic neural fusion for real-time, open set scene understanding. In ICRA, 2023.
- Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In ICRA, 2017.
- Generative zero-shot learning for semantic segmentation of 3D point cloud. In 3DV, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- An End-to-End Transformer Model for 3D Object Detection. In ICCV, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- OpenScene: 3D Scene Understanding with Open Vocabularies. In CVPR, 2023.
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017a.
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In NeurIPS, 2017b.
- Deep hough voting for 3d object detection in point clouds. In ICCV, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
- Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
- Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
- Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA, 2023.
- Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In NeurIPS, 2023.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In ICRA, 2015.
- Seggpt: Segmenting everything in context. In ICCV, 2023.
- Dynamic graph cnn for learning on point clouds. ACM Trans. Graph., 2019.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, 2022.
- Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In CVPR, 2021.
- To-scene: A large-scale dataset for understanding 3d tabletop scenes. In ECCV, 2022.
- Mm-3dscene: 3d scene understanding by customizing masked modeling with informative-preserved reconstruction and self-distilled consistency. In CVPR, 2023.
- Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In CVPR, 2020.
- SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. In ECCV, 2018.
- Zero-shot point cloud segmentation by semantic-visual aware synthesis. In ICCV, 2023a.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023b.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
- Sam3d: Zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023b.
- Point transformer. In ICCV, 2021.
- Sess: Self-ensembling semi-supervised 3d object detection. In CVPR, 2020.
- Generalized decoding for pixel, image, and language. In CVPR, 2023a.
- Segment everything everywhere all at once. In NeurIPS, 2023b.