Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane (2405.17596v2)

Published 27 May 2024 in cs.CV

Abstract: 3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at https://quyans.github.io/GOI-Hyperplane/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 5460–5469. https://doi.org/10.1109/CVPR52688.2022.00539
  2. On the opportunities and risks of foundation models. ArXiv preprint abs/2108.07258 (2021). https://arxiv.org/abs/2108.07258
  3. Emerging Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9630–9640. https://doi.org/10.1109/ICCV48922.2021.00951
  4. Sim VQA: Exploring Simulated Environments for Visual Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 5046–5056. https://doi.org/10.1109/CVPR52688.2022.00500
  5. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision. Springer, 333–350.
  6. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 11509–11522.
  7. Differentiable Surface Rendering via Non-Differentiable Sampling. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 6068–6077. https://doi.org/10.1109/ICCV48922.2021.00603
  8. Hybrid neural rendering for large-scale scenes with motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 154–164.
  9. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems 35 (2022), 3403–3416.
  10. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In 2022 International Conference on 3D Vision (3DV). IEEE, 1–11.
  11. FastNeRF: High-Fidelity Neural Rendering at 200FPS. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 14326–14335. https://doi.org/10.1109/ICCV48922.2021.01408
  12. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19740–19750.
  13. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10608–10615.
  14. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42, 4 (2023), 1–14.
  15. LERF: Language Embedded Radiance Fields. In International Conference on Computer Vision (ICCV).
  16. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
  17. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems 35 (2022), 23311–23330.
  18. Language-driven Semantic Segmentation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=RriDjddCLN
  19. OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding. ArXiv preprint abs/2402.04648 (2024). https://arxiv.org/abs/2402.04648
  20. Weakly supervised 3d open-vocabulary segmentation. Advances in Neural Information Processing Systems 36 (2023), 53433–53456.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. ArXiv preprint abs/2303.05499 (2023). https://arxiv.org/abs/2303.05499
  22. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision. Springer, 210–227.
  23. Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In Conference on Robot Learning. PMLR, 1610–1620.
  24. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 65, 1 (2021), 99–106.
  25. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41, 4 (2022), 1–15.
  26. Dinov2: Learning robust visual features without supervision. ArXiv preprint abs/2304.07193 (2023). https://arxiv.org/abs/2304.07193
  27. Dreamfusion: Text-to-3d using 2d diffusion. ArXiv preprint abs/2209.14988 (2022). https://arxiv.org/abs/2209.14988
  28. Dynamic point fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7964–7976.
  29. LangSplat: 3D Language Gaussian Splatting. ArXiv preprint abs/2312.16084 (2023). https://arxiv.org/abs/2312.16084
  30. Sg-nerf: Semantic-guided point-based neural radiance fields. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 570–575.
  31. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
  32. KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 14315–14325. https://doi.org/10.1109/ICCV48922.2021.01407
  33. Merf: Memory-efficient radiance fields for real-time view synthesis in unbounded scenes. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–12.
  34. Grounded sam: Assembling open-world models for diverse visual tasks. ArXiv preprint abs/2401.14159 (2024). https://arxiv.org/abs/2401.14159
  35. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  36. Distilled feature fields enable few-shot language-guided manipulation. ArXiv preprint abs/2308.07931 (2023). https://arxiv.org/abs/2308.07931
  37. Aligning and Prompting Everything All at Once for Universal Visual Perception. ArXiv preprint abs/2312.02153 (2023). https://arxiv.org/abs/2312.02153
  38. Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. ArXiv preprint abs/2311.18482 (2023). https://arxiv.org/abs/2311.18482
  39. The Replica dataset: A digital replica of indoor spaces. ArXiv preprint abs/1906.05797 (2019). https://arxiv.org/abs/1906.05797
  40. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV). IEEE, 443–453.
  41. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 27171–27183. https://proceedings.neurips.cc/paper/2021/hash/e41e164f7485ec4a28741a2d0ea41c74-Abstract.html
  42. Hf-neus: Improved surface reconstruction using high-frequency details. Advances in Neural Information Processing Systems 35 (2022), 1966–1978.
  43. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5438–5448.
  44. Gaussian grouping: Segment and edit anything in 3d scenes. ArXiv preprint abs/2312.00732 (2023). https://arxiv.org/abs/2312.00732
  45. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
  46. In-Place Scene Labelling and Understanding with Implicit Scene Representation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 15818–15827. https://doi.org/10.1109/ICCV48922.2021.01554
  47. Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. ArXiv preprint abs/2312.03203 (2023). https://arxiv.org/abs/2312.03203
  48. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=gZ9hCDWe6ke
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yansong Qu (15 papers)
  2. Shaohui Dai (3 papers)
  3. Xinyang Li (61 papers)
  4. Jianghang Lin (11 papers)
  5. Liujuan Cao (73 papers)
  6. Rongrong Ji (315 papers)
  7. ShengChuan Zhang (41 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com