Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Segment Anything in 3D with Radiance Fields (2304.12308v5)

Published 24 Apr 2023 in cs.CV

Abstract: The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
  2. Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. arXiv preprint arXiv:2208.07227, 2022.
  3. Tensorf: Tensorial radiance fields. In ECCV, 2022.
  4. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
  5. Focalclick: Towards practical interactive image segmentation. In CVPR, 2022.
  6. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  7. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  8. Surfconv: Bridging 3d and 2d convolution for rgbd images. In CVPR, 2018.
  9. Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  11. Nerf-sos: Any-view self-supervised object segmentation on complex scenes. arXiv preprint arXiv:2209.08776, 2022.
  12. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, 2022.
  13. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  14. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In 3DV, 2022.
  15. Interactive segmentation of radiance fields. arXiv preprint arXiv:2212.13545, 2022.
  16. Bird’s-eye-view panoptic segmentation using monocular frontal view images. IEEE Robot. Autom. Lett., 2022.
  17. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In CoRL, 2022.
  18. Mask R-CNN. In ICCV, 2017.
  19. Baking neural radiance fields for real-time view synthesis. In ICCV, 2021.
  20. Instance neural radiance field. arXiv preprint arXiv:2304.04395, 2023.
  21. Point cloud labeling using 3d convolutional neural network. In ICPR, 2016.
  22. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
  23. Lerf: Language embedded radiance fields. arXiv preprint arXiv:2303.09553, 2023.
  24. Panoptic segmentation. In CVPR, 2019.
  25. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  26. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph., 2017.
  27. Decomposing nerf for editing via feature field distillation. In NeurIPS, 2022.
  28. Spidr: Sdf-based neural point fields for illumination and deformation. arXiv preprint arXiv:2210.08398, 2022.
  29. Nerf-supervision: Learning dense object descriptors from neural radiance fields. In ICRA, 2022.
  30. Autoint: Automatic integration for fast neural volume rendering. In CVPR, 2021.
  31. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In CVPR, 2023.
  32. Simpleclick: Interactive image segmentation with simple vision transformers. In ICCV, 2023.
  33. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  34. Unsupervised multi-view object segmentation using radiance field propagation. In NeurIPS, 2022.
  35. Point-voxel cnn for efficient 3d deep learning. NeurIPS, 2019.
  36. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  37. Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph., 2019.
  38. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  39. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023.
  40. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022.
  41. GIRAFFE: representing scenes as compositional generative neural feature fields. In CVPR, 2021.
  42. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  43. Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
  44. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  45. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017.
  46. Learning transferable visual models from natural language supervision. In ICML, 2021.
  47. Neural volumetric object selection. In CVPR, 2022.
  48. "grabcut": interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., 2004.
  49. Reviving iterative training with mask guidance for interactive segmentation. In ICIP, 2022.
  50. Decomposing 3d scenes into objects via unsupervised volume segmentation. arXiv preprint arXiv:2104.01148, 2021.
  51. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  52. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  53. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
  54. Improved direct voxel grid optimization for radiance fields reconstruction. arXiv preprint arXiv:2212.13545, 2022.
  55. Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
  56. Delicate textured mesh recovery from nerf via adaptive surface refinement. arXiv preprint arXiv:2303.02091, 2023.
  57. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 3DV, 2022.
  58. Attention is all you need. In NeurIPS, 2017.
  59. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. arXiv preprint arXiv:2111.13260, 2021.
  60. Depth-aware cnn for rgb-d segmentation. In ECCV, 2018.
  61. Nex: Real-time view synthesis with neural basis expansion. In CVPR, 2021.
  62. Depth-adapted cnns for rgb-d semantic segmentation. arXiv preprint arXiv:2206.03939, 2022.
  63. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  64. Malleable 2.5 d convolution: Learning receptive fields along the depth-axis for rgb-d scene parsing. In ECCV, 2020.
  65. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
  66. Bakedsdf: Meshing neural sdfs for real-time view synthesis. arXiv preprint arXiv:2302.14859, 2023.
  67. Lidarmultinet: Towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385, 2022.
  68. Unsupervised discovery of object radiance fields. In ICLR, 2022.
  69. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
  70. Point transformer. In ICCV, 2021.
  71. Pyramid scene parsing network. In CVPR, 2017.
  72. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  73. In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.
  74. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550, 2020.
  75. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiazhong Cen (6 papers)
  2. Zanwei Zhou (6 papers)
  3. Jiemin Fang (33 papers)
  4. Chen Yang (193 papers)
  5. Wei Shen (181 papers)
  6. Lingxi Xie (137 papers)
  7. Xiaopeng Zhang (100 papers)
  8. Qi Tian (314 papers)
Citations (39)

Summary

  • The paper introduces SA3D, a novel framework that extends 2D segmentation via NeRFs to generate accurate 3D masks with iterative refinement.
  • It employs manual prompts, mask inverse rendering, and cross-view self-prompting to seamlessly bridge 2D and 3D segmentation.
  • Experiments on Replica, NVOS, and SPIn-NeRF show an mIoU improvement of over 6.5% compared to state-of-the-art methods.

Essay: Segment Anything in 3D with NeRFs

The paper "Segment Anything in 3D with NeRFs" systematically explores the extension of the Segment Anything Model (SAM) into the three-dimensional (3D) domain using Neural Radiance Fields (NeRFs). This research targets a significant gap in the literature concerning the translation of 2D segmentation capabilities into 3D space, leveraging computational efficiencies and avoided expenses typically associated with direct 3D data annotations.

Methodological Overview

The core proposition of this work is a novel framework named SA3D, which integrates SAM's 2D segmentation abilities with a 3D perception facilitated by NeRFs. This approach circumvents the need for ground-up 3D dataset creation. SA3D operates by employing manual segmentation prompts on a single rendered view, generating initial 2D masks through SAM, and applying an iterative process to extrapolate a complete 3D mask. This iterative method includes mask inverse rendering and cross-view self-prompting phases, using neural radiance densities to project 2D masks into 3D voxel grids and extract prompts from other perspectives to refine and extend the 3D segmentation mask.

Empirical Evaluation

The authors conducted experiments on well-established datasets such as Replica, NVOS, and SPIn-NeRF. The results indicate a notable increase in segmentation accuracy and efficiency, with the method requiring mere minutes to achieve 3D object segmentation. On the NVOS dataset, SA3D achieved an mIoU improvement over the contemporary state-of-the-art of more than 6.5%. These results underscore the effectiveness of using NeRFs as a bridging tool between 2D and 3D segmentation paradigms.

Implications and Future Directions

Practically, SA3D provides a streamlined path to 3D segmentation without the substantial overhead of direct 3D data annotation. This advancement could facilitate the development of more efficient 3D modeling applications and comprehensive virtual environments, potentially broadening accessibility and reducing costs.

Theoretically, this research introduces a promising methodology for extending 2D foundational models to 3D spaces using structural priors, provided these models maintain consistent segmentation across multiple views. This insight opens the potential for future innovations, where 2D models can systematically gain 3D capabilities through similar integrative approaches, fostering a new class of versatile vision models.

Broader Impact

The adaptation of SA3D demonstrates tangible possibilities in diverse sectors, including video game development, augmented reality applications, and robotics, where swift and reliable 3D environmental interpretations are crucial. The framework's reliance on inexpensive off-the-shelf algorithms further enhances its practical appeal.

Conclusion

The work compellingly illustrates the effective use of SAM in conjunction with NeRF to resolve the challenges of 3D segmentation. By leveraging the strengths of modern neural architectures, the authors provide a crucial link between 2D image understanding and 3D spatial awareness. As research continues to build on these foundations, the implications for AI advancements in multidimensional vision models offer an expansive field of paper with significant practical potential.

Github Logo Streamline Icon: https://streamlinehq.com