Segment Anything in 3D with Radiance Fields (2304.12308v5)
Abstract: The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
- Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. arXiv preprint arXiv:2208.07227, 2022.
- Tensorf: Tensorial radiance fields. In ECCV, 2022.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
- Focalclick: Towards practical interactive image segmentation. In CVPR, 2022.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Surfconv: Bridging 3d and 2d convolution for rgbd images. In CVPR, 2018.
- Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Nerf-sos: Any-view self-supervised object segmentation on complex scenes. arXiv preprint arXiv:2209.08776, 2022.
- Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, 2022.
- Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
- Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In 3DV, 2022.
- Interactive segmentation of radiance fields. arXiv preprint arXiv:2212.13545, 2022.
- Bird’s-eye-view panoptic segmentation using monocular frontal view images. IEEE Robot. Autom. Lett., 2022.
- Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In CoRL, 2022.
- Mask R-CNN. In ICCV, 2017.
- Baking neural radiance fields for real-time view synthesis. In ICCV, 2021.
- Instance neural radiance field. arXiv preprint arXiv:2304.04395, 2023.
- Point cloud labeling using 3d convolutional neural network. In ICPR, 2016.
- Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- Lerf: Language embedded radiance fields. arXiv preprint arXiv:2303.09553, 2023.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph., 2017.
- Decomposing nerf for editing via feature field distillation. In NeurIPS, 2022.
- Spidr: Sdf-based neural point fields for illumination and deformation. arXiv preprint arXiv:2210.08398, 2022.
- Nerf-supervision: Learning dense object descriptors from neural radiance fields. In ICRA, 2022.
- Autoint: Automatic integration for fast neural volume rendering. In CVPR, 2021.
- Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In CVPR, 2023.
- Simpleclick: Interactive image segmentation with simple vision transformers. In ICCV, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Unsupervised multi-view object segmentation using radiance field propagation. In NeurIPS, 2022.
- Point-voxel cnn for efficient 3d deep learning. NeurIPS, 2019.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph., 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022.
- GIRAFFE: representing scenes as compositional generative neural feature fields. In CVPR, 2021.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Neural volumetric object selection. In CVPR, 2022.
- "grabcut": interactive foreground extraction using iterated graph cuts. ACM Trans. Graph., 2004.
- Reviving iterative training with mask guidance for interactive segmentation. In ICIP, 2022.
- Decomposing 3d scenes into objects via unsupervised volume segmentation. arXiv preprint arXiv:2104.01148, 2021.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
- Improved direct voxel grid optimization for radiance fields reconstruction. arXiv preprint arXiv:2212.13545, 2022.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Delicate textured mesh recovery from nerf via adaptive surface refinement. arXiv preprint arXiv:2303.02091, 2023.
- Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 3DV, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. arXiv preprint arXiv:2111.13260, 2021.
- Depth-aware cnn for rgb-d segmentation. In ECCV, 2018.
- Nex: Real-time view synthesis with neural basis expansion. In CVPR, 2021.
- Depth-adapted cnns for rgb-d semantic segmentation. arXiv preprint arXiv:2206.03939, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Malleable 2.5 d convolution: Learning receptive fields along the depth-axis for rgb-d scene parsing. In ECCV, 2020.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Bakedsdf: Meshing neural sdfs for real-time view synthesis. arXiv preprint arXiv:2302.14859, 2023.
- Lidarmultinet: Towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385, 2022.
- Unsupervised discovery of object radiance fields. In ICLR, 2022.
- Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
- Point transformer. In ICCV, 2021.
- Pyramid scene parsing network. In CVPR, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.
- Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550, 2020.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
- Jiazhong Cen (6 papers)
- Zanwei Zhou (6 papers)
- Jiemin Fang (33 papers)
- Chen Yang (193 papers)
- Wei Shen (181 papers)
- Lingxi Xie (137 papers)
- Xiaopeng Zhang (100 papers)
- Qi Tian (314 papers)