SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation (2311.17707v1)
Abstract: We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.
- 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Segment anything in 3d with nerfs. In NeurIPS, 2023.
- Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV, 2020.
- Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, 2023a.
- 4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. In ECCV, 2022.
- Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, 2023b.
- Sspc-net: Semi-supervised semantic 3d point cloud segmentation network. In AAAI, 2021.
- Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In ECCV, 2022.
- 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR, 2019.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding. In CVPR, 2023.
- Efficient graph-based image segmentation. IJCV, 2004.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- Learning 3d semantic segmentation with only 2d image supervision. In 3DV, 2021.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
- Point-wise Convolutional Neural Network. In CVPR, 2018.
- Scenenn: A scene meshes dataset with annotations. In 3DV, 2016.
- Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023.
- Spatio-temporal self-supervised representation learning for 3d point clouds. In ICCV, 2021.
- Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, 2019.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, 2020.
- Segment anything in high quality. In NeurIPS, 2023.
- Segment anything. In ICCV, 2023.
- Virtual multi-view fusion for 3d semantic segmentation. In ECCV, 2020.
- MSeg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
- PointCNN: Convolution on X-transformed Points. In NeurIPS, 2018.
- Openrooms: An open framework for photorealistic indoor scene datasets. In CVPR, 2021.
- Fpconv: Learning local flattening for point convolution. In CVPR, 2020.
- Weakly supervised 3d scene segmentation with region-level boundary awareness and instance discrimination. In ECCV, 2022.
- Segment any point cloud sequences by distilling vision foundation models. In NeurIPS, 2023.
- A closer look at local aggregation operators in point cloud analysis. In ECCV, 2020.
- Group-free 3d object detection via transformers. In ICCV, 2021.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. In ICLR, 2022.
- Feature-realistic neural fusion for real-time, open set scene understanding. In ICRA, 2023.
- Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In ICRA, 2017.
- Generative zero-shot learning for semantic segmentation of 3D point cloud. In 3DV, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- An End-to-End Transformer Model for 3D Object Detection. In ICCV, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- OpenScene: 3D Scene Understanding with Open Vocabularies. In CVPR, 2023.
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017a.
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In NeurIPS, 2017b.
- Deep hough voting for 3d object detection in point clouds. In ICCV, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
- Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
- Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
- Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA, 2023.
- Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In NeurIPS, 2023.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In ICRA, 2015.
- Seggpt: Segmenting everything in context. In ICCV, 2023.
- Dynamic graph cnn for learning on point clouds. ACM Trans. Graph., 2019.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, 2022.
- Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In CVPR, 2021.
- To-scene: A large-scale dataset for understanding 3d tabletop scenes. In ECCV, 2022.
- Mm-3dscene: 3d scene understanding by customizing masked modeling with informative-preserved reconstruction and self-distilled consistency. In CVPR, 2023.
- Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In CVPR, 2020.
- SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. In ECCV, 2018.
- Zero-shot point cloud segmentation by semantic-visual aware synthesis. In ICCV, 2023a.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023b.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
- Sam3d: Zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023b.
- Point transformer. In ICCV, 2021.
- Sess: Self-ensembling semi-supervised 3d object detection. In CVPR, 2020.
- Generalized decoding for pixel, image, and language. In CVPR, 2023a.
- Segment everything everywhere all at once. In NeurIPS, 2023b.