LangSplat: 3D Language Gaussian Splatting (2312.16084v2)
Abstract: Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/
- Scanqa: 3d question answering for spatial scene understanding. In CVPR, pages 19129–19139, 2022.
- Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion. arXiv preprint arXiv:2306.04633, 2023.
- Nancy Bonvillain. Language, culture, and communication: The meaning of messages. Rowman & Littlefield, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
- Simvqa: Exploring simulated environments for visual question answering. In CVPR, pages 5056–5066, 2022.
- Segment anything in 3d with nerfs. 2023.
- pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR, pages 5799–5809, 2021.
- Tensorf: Tensorial radiance fields. In ECCVn, pages 333–350. Springer, 2022.
- Open-vocabulary queryable scene representations for real world planning. In ICRA, pages 11509–11522. IEEE, 2023a.
- Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023b.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Editanything: Empowering unparalleled flexibility in image editing and generation. In ACM MM, Demo track, 2023.
- Iqa: Visual question answering in interactive environments. In CVPR, pages 4089–4098, 2018.
- Visual language maps for robot navigation. In ICRA, pages 10608–10615. IEEE, 2023a.
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023b.
- Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):1–14, 2023.
- Lerf: Language embedded radiance fields. In ICCV, pages 19729–19739, 2023.
- Segment anything. In ICCV, 2023.
- Decomposing nerf for editing via feature field distillation. NeurIPS, 35:23311–23330, 2022.
- Language-driven semantic segmentation. In ICLR, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
- Weakly supervised 3d open-vocabulary segmentation. In NeurIPS, 2023a.
- Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347, 2023b.
- Can sam boost video super-resolution? arXiv preprint arXiv:2305.06524, 2023.
- Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
- Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
- On the opportunities and challenges of foundation models for geospatial artificial intelligence. arXiv preprint arXiv:2304.06798, 2023.
- Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023.
- Instant neural graphics primitives with a multiresolution hash encoding. TOG, 41(4):1–15, 2022.
- Nerfies: Deformable neural radiance fields. In ICCV, pages 5865–5874, 2021.
- D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
- Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022.
- Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:2304.10261, 2023a.
- Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931, 2023b.
- Panoptic lifting for 3d scene understanding with neural fields. In CVPR, pages 9043–9052, 2023.
- Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5459–5469, 2022.
- Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
- Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 3DV, pages 443–453. IEEE, 2022.
- 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023.
- Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023a.
- Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023b.
- Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642, 2023c.
- Matte anything: Interactive natural image matting with segment anything models. arXiv preprint arXiv:2306.04121, 2023.
- Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
- pixelnerf: Neural radiance fields from one or few images. In CVPR, pages 4578–4587, 2021.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, pages 15838–15847, 2021.
- Ewa volume splatting. In VIS, pages 29–538. IEEE, 2001.