GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane (2405.17596v2)
Abstract: 3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at https://quyans.github.io/GOI-Hyperplane/ .
- Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 5460–5469. https://doi.org/10.1109/CVPR52688.2022.00539
- On the opportunities and risks of foundation models. ArXiv preprint abs/2108.07258 (2021). https://arxiv.org/abs/2108.07258
- Emerging Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9630–9640. https://doi.org/10.1109/ICCV48922.2021.00951
- Sim VQA: Exploring Simulated Environments for Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 5046–5056. https://doi.org/10.1109/CVPR52688.2022.00500
- Tensorf: Tensorial radiance fields. In European Conference on Computer Vision. Springer, 333–350.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 11509–11522.
- Differentiable Surface Rendering via Non-Differentiable Sampling. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 6068–6077. https://doi.org/10.1109/ICCV48922.2021.00603
- Hybrid neural rendering for large-scale scenes with motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 154–164.
- Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems 35 (2022), 3403–3416.
- Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In 2022 International Conference on 3D Vision (3DV). IEEE, 1–11.
- FastNeRF: High-Fidelity Neural Rendering at 200FPS. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 14326–14335. https://doi.org/10.1109/ICCV48922.2021.01408
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19740–19750.
- Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10608–10615.
- 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42, 4 (2023), 1–14.
- LERF: Language Embedded Radiance Fields. In International Conference on Computer Vision (ICCV).
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
- Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems 35 (2022), 23311–23330.
- Language-driven Semantic Segmentation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=RriDjddCLN
- OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding. ArXiv preprint abs/2402.04648 (2024). https://arxiv.org/abs/2402.04648
- Weakly supervised 3d open-vocabulary segmentation. Advances in Neural Information Processing Systems 36 (2023), 53433–53456.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. ArXiv preprint abs/2303.05499 (2023). https://arxiv.org/abs/2303.05499
- Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision. Springer, 210–227.
- Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In Conference on Robot Learning. PMLR, 1610–1620.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15.
- Dinov2: Learning robust visual features without supervision. ArXiv preprint abs/2304.07193 (2023). https://arxiv.org/abs/2304.07193
- Dreamfusion: Text-to-3d using 2d diffusion. ArXiv preprint abs/2209.14988 (2022). https://arxiv.org/abs/2209.14988
- Dynamic point fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7964–7976.
- LangSplat: 3D Language Gaussian Splatting. ArXiv preprint abs/2312.16084 (2023). https://arxiv.org/abs/2312.16084
- Sg-nerf: Semantic-guided point-based neural radiance fields. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 570–575.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
- KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 14315–14325. https://doi.org/10.1109/ICCV48922.2021.01407
- Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–12.
- Grounded sam: Assembling open-world models for diverse visual tasks. ArXiv preprint abs/2401.14159 (2024). https://arxiv.org/abs/2401.14159
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
- Distilled feature fields enable few-shot language-guided manipulation. ArXiv preprint abs/2308.07931 (2023). https://arxiv.org/abs/2308.07931
- Aligning and Prompting Everything All at Once for Universal Visual Perception. ArXiv preprint abs/2312.02153 (2023). https://arxiv.org/abs/2312.02153
- Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. ArXiv preprint abs/2311.18482 (2023). https://arxiv.org/abs/2311.18482
- The Replica dataset: A digital replica of indoor spaces. ArXiv preprint abs/1906.05797 (2019). https://arxiv.org/abs/1906.05797
- Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV). IEEE, 443–453.
- NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 27171–27183. https://proceedings.neurips.cc/paper/2021/hash/e41e164f7485ec4a28741a2d0ea41c74-Abstract.html
- Hf-neus: Improved surface reconstruction using high-frequency details. Advances in Neural Information Processing Systems 35 (2022), 1966–1978.
- Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5438–5448.
- Gaussian grouping: Segment and edit anything in 3d scenes. ArXiv preprint abs/2312.00732 (2023). https://arxiv.org/abs/2312.00732
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
- In-Place Scene Labelling and Understanding with Implicit Scene Representation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 15818–15827. https://doi.org/10.1109/ICCV48922.2021.01554
- Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. ArXiv preprint abs/2312.03203 (2023). https://arxiv.org/abs/2312.03203
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=gZ9hCDWe6ke
- Yansong Qu (15 papers)
- Shaohui Dai (3 papers)
- Xinyang Li (61 papers)
- Jianghang Lin (11 papers)
- Liujuan Cao (73 papers)
- Rongrong Ji (315 papers)
- ShengChuan Zhang (41 papers)