Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation (2311.17707v1)

Published 29 Nov 2023 in cs.CV

Abstract: We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
  2. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  3. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  4. Segment anything in 3d with nerfs. In NeurIPS, 2023.
  5. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
  6. Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV, 2020.
  7. Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, 2023a.
  8. 4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. In ECCV, 2022.
  9. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, 2023b.
  10. Sspc-net: Semi-supervised semantic 3d point cloud segmentation network. In AAAI, 2021.
  11. Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In ECCV, 2022.
  12. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR, 2019.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  14. PLA: Language-Driven Open-Vocabulary 3D Scene Understanding. In CVPR, 2023.
  15. Efficient graph-based image segmentation. IJCV, 2004.
  16. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  17. Learning 3d semantic segmentation with only 2d image supervision. In 3DV, 2021.
  18. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
  19. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  20. Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
  21. Point-wise Convolutional Neural Network. In CVPR, 2018.
  22. Scenenn: A scene meshes dataset with annotations. In 3DV, 2016.
  23. Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023.
  24. Spatio-temporal self-supervised representation learning for 3d point clouds. In ICCV, 2021.
  25. Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, 2019.
  26. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, 2020.
  27. Segment anything in high quality. In NeurIPS, 2023.
  28. Segment anything. In ICCV, 2023.
  29. Virtual multi-view fusion for 3d semantic segmentation. In ECCV, 2020.
  30. MSeg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
  31. PointCNN: Convolution on X-transformed Points. In NeurIPS, 2018.
  32. Openrooms: An open framework for photorealistic indoor scene datasets. In CVPR, 2021.
  33. Fpconv: Learning local flattening for point convolution. In CVPR, 2020.
  34. Weakly supervised 3d scene segmentation with region-level boundary awareness and instance discrimination. In ECCV, 2022.
  35. Segment any point cloud sequences by distilling vision foundation models. In NeurIPS, 2023.
  36. A closer look at local aggregation operators in point cloud analysis. In ECCV, 2020.
  37. Group-free 3d object detection via transformers. In ICCV, 2021.
  38. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. In ICLR, 2022.
  39. Feature-realistic neural fusion for real-time, open set scene understanding. In ICRA, 2023.
  40. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In ICRA, 2017.
  41. Generative zero-shot learning for semantic segmentation of 3D point cloud. In 3DV, 2021.
  42. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  43. An End-to-End Transformer Model for 3D Object Detection. In ICCV, 2021.
  44. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  45. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  46. OpenScene: 3D Scene Understanding with Open Vocabularies. In CVPR, 2023.
  47. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017a.
  48. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In NeurIPS, 2017b.
  49. Deep hough voting for 3d object detection in point clouds. In ICCV, 2019.
  50. Learning transferable visual models from natural language supervision. In ICML, 2021.
  51. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
  52. Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
  53. Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
  54. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA, 2023.
  55. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
  56. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
  57. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  58. OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In NeurIPS, 2023.
  59. Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
  60. Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In ICRA, 2015.
  61. Seggpt: Segmenting everything in context. In ICCV, 2023.
  62. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph., 2019.
  63. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, 2022.
  64. Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
  65. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
  66. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In CVPR, 2021.
  67. To-scene: A large-scale dataset for understanding 3d tabletop scenes. In ECCV, 2022.
  68. Mm-3dscene: 3d scene understanding by customizing masked modeling with informative-preserved reconstruction and self-distilled consistency. In CVPR, 2023.
  69. Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In CVPR, 2020.
  70. SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. In ECCV, 2018.
  71. Zero-shot point cloud segmentation by semantic-visual aware synthesis. In ICCV, 2023a.
  72. Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023b.
  73. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
  74. Sam3d: Zero-shot 3d object detection via segment anything model. arXiv preprint arXiv:2306.02245, 2023b.
  75. Point transformer. In ICCV, 2021.
  76. Sess: Self-ensembling semi-supervised 3d object detection. In CVPR, 2020.
  77. Generalized decoding for pixel, image, and language. In CVPR, 2023a.
  78. Segment everything everywhere all at once. In NeurIPS, 2023b.
Citations (12)

Summary

  • The paper introduces a framework that uses 3D point prompts to leverage 2D SAM for efficient zero-shot scene segmentation.
  • The method projects 3D points onto 2D frames and filters low-quality prompts to ensure consistent segmentation across views.
  • Experimental results show that SAMPro3D achieves higher mIoU than existing methods, outperforming even some fully supervised approaches.

An Overview of SAMPro3D: Zero-Shot 3D Scene Segmentation

The paper "SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation" introduces SAMPro3D, a framework for directly applying the Segment Anything Model (SAM) to achieve zero-shot 3D indoor scene segmentation. The proposed method efficiently transfers the segmentation capacity of SAM from 2D images to 3D data by treating 3D points in a scene as natural prompts for SAM.

Framework and Methodology

SAMPro3D utilizes a unique approach wherein 3D points are projected onto 2D frames to serve as prompts for SAM, which works within a zero-shot framework without requiring further training on domain-specific datasets. This methodology exploits the preservation of SAM's zero-shot capabilities and aligns pixel prompts across frames to ensure consistency in segmentation. The reliability and efficacy of the proposed method are examined through qualitative and quantitative approaches, demonstrating superior performance over existing zero-shot and supervised methods, occasionally exceeding human-level annotations.

A key component of SAMPro3D lies in the frame-consistent alignment of projected pixel prompts achieved by filtering out low-quality prompts based on segmentation feedback. The technique also consolidates prompts associated with the same object to produce comprehensive segmentation results, effectively addressing challenges in consistency across different 2D frames and incomplete object coverage.

The pipeline efficiently integrates SAM into a series of stages, whereby 3D prompts are first proposed and then filtered based on their mask quality across frames. This is followed by a prompt consolidation process, ensuring comprehensive segmentation. The end stage involves deriving 3D masks from the accumulated segmentations from different frames.

Numerical Results and Implications

One of the notable strengths of SAMPro3D is its capability to deliver richer segmentation results than previous models, achieving a higher mean Intersection over Union (mIoU) in comparison to established methodologies like SAM3D and fully supervised approaches like Mask3D. These results significantly underscore the method's robustness and scalability, particularly within environments where precise 3D understanding is crucial.

The proposed framework is especially valuable in practical applications such as augmented reality and robotics, where accurate scene comprehension without extensive labeled 3D datasets is essential. The paper also suggests that improvements in 2D images, such as those from HQ-SAM and Mobile-SAM, can be directly leveraged to enhance 3D segmentation results, reaffirming the seminal concept of leveraging advanced 2D segmentation techniques for 3D applications.

Theoretical and Practical Implications

Theoretically, SAMPro3D presents a paradigm shift in the approach to 3D scene segmentation, highlighting the potential of leveraging pre-trained 2D models and extending their capabilities into 3D scenes. It challenges conventional paradigms that often rely heavily on domain-specific 3D pre-training and suggest an alternative pathway through which scene understanding can be achieved more dynamically.

Practically, the technique offers the advantage of being deployable on novel 3D scenes without requiring extensive dataset-specific training, making it suitable for deployment across various domains with limited computational resources and time. This advancement can significantly enhance the operational efficiency of systems operating in real-time settings and derive benefits in sectors beyond traditional computational photography and vision tasks.

Future Directions

Future research could explore further enhancements to the framework, focusing on improving the adaptability to various 3D environments and refining the prompt generation techniques. Additionally, investigating the integration of other pre-trained models or novel prompt generation strategies might further augment the zero-shot segmentation capability of SAMPro3D.

In conclusion, SAMPro3D signifies a substantial step forward in zero-shot 3D scene segmentation, adhering to a model-agnostic philosophy and offering a robust solution that can parallel human-like segmentation accuracy and diversity without extensive retraining on bespoke datasets. This framework sets a precedence for subsequent advancements in the domain, potentially accelerating developments within AI-driven 3D scene understanding and its interdisciplinary applications.

Github Logo Streamline Icon: https://streamlinehq.com