OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views (2404.03650v1)
Abstract: Large visual-LLMs (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-LLMs. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
- Do as I Can, Not as I Say: Grounding Language in Robotic Affordances. arXiv preprint arXiv:2204.01691, 2022.
- Joint 2D-3D-Semantic Data for Indoor Scene Understanding. International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. neurips, 2021.
- 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- ScanNet: Richly-annotated 3d reconstructions of indoor scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
- BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration. ACM Transactions on Graphics 2017 (TOG), 2017b.
- Scenefun3d: Fine-grained functionality and affordance understanding in 3d scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- E Knuth Donald et al. The Art of Computer Programming. Sorting and searching, 1999.
- Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. In European Conference on Computer Vision (ECCV), 2022.
- Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. In Conference on Robot Learning (CoRL), 2022.
- OccuSeg: Occupancy-Aware 3D Instance Segmentation. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation. In International Conference on Computer Vision (ICCV), 2021.
- ConceptFusion: Open-Set Multimodal 3D Mapping. Robotics: Science and Systems (RSS), 2023.
- Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In International Conference on Machine Learning (ICML), 2021.
- LERF: Language Embedded Radiance Fields. In International Conference on Computer Vision (ICCV), 2023.
- Decomposing NeRF for Editing via Feature Field Distillation. International Conference on Neural Information Processing Systems (NeurIPS), 2022.
- 4D-STOP: Panoptic Segmentation of 4D Lidar using Spatio-Temporal Object Proposal Generation and Aggregation. In European Conference on Computer Vision (ECCV) Workshops, 2022.
- Panoptic neural fields: A semantic object-aware neural scene representation. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Marc Levoy. Efficient ray tracing of volume data. ACM Transactions on Graphics (TOG), 1990.
- Language-Driven Semantic Segmentation. International Conference on Learning Representations (ICLR), 2022a.
- Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
- UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
- OpenScene: 3D Scene Understanding with Open Vocabularies. International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. International Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
- DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Language-Grounded Indoor 3D Semantic Segmentation in the Wild. In European Conference on Computer Vision (ECCV), 2022.
- Mask3D: Mask Transformer for 3D Instance Segmentation. International Conference on Robotics and Automation (ICRA), 2023.
- Panoptic Lifting for 3D Scene Understanding with Neural Fields. International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv preprint arXiv:1906.05797, 2019.
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In International Conference on Neural Information Processing Systems (NeurIPS), 2023a.
- 3D Segmentation of Humans in Point Clouds with Synthetic Data. In International Conference on Computer Vision (ICCV), 2023b.
- Fourier features let networks learn high frequency functions in low dimensional domains. In International Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Neural Feature Fusion Fields: 3D distillation of self-supervised 2D image representations. In International Conference on 3D Vision (3DV), 2022.
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In International Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Samuel S Wilks. Certain Generalizations in the Analysis of Variance. Biometrika, 1932.
- Volume rendering of neural implicit surfaces. In International Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. In International Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- In-place Scene Labeling and Understanding with Implicit Scene Representation. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Nice-SLAM: Neural Implicit Scalable Encoding for SLAM. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- ICGNet: A Unified Approach for Instance-Centric Grasping. In International Conference on Robotics and Automation (ICRA), 2024.
- Francis Engelmann (37 papers)
- Fabian Manhardt (41 papers)
- Michael Niemeyer (29 papers)
- Keisuke Tateno (12 papers)
- Marc Pollefeys (230 papers)
- Federico Tombari (214 papers)