Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting (2403.15624v2)

Published 22 Mar 2024 in cs.CV

Abstract: Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and jointly optimize color and semantic features to achieve rendering and scene understanding simultaneously. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is to distill knowledge from 2D pre-trained models to 3D Gaussians. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. The quantitative results on ScanNet segmentation and LERF object localization demonstates the superior performance of our method. Additionally, we explore several applications of Semantic Gaussians including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.

Authors (5)

Jun Guo (130 papers)
Xiaojian Ma (52 papers)
Yue Fan (46 papers)
Huaping Liu (97 papers)
Qing Li (430 papers)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel framework that integrates 2D semantic projections with a 3D semantic network for improved open-vocabulary scene understanding.
It demonstrates significant performance gains, including a 4.2% mIoU improvement on ScanNet-20 for semantic segmentation.
The approach enables interactive applications like scene editing via language-guided commands, showcasing its potential in AR and robotics.

Semantic Gaussians: Advancing Open-Vocabulary 3D Scene Understanding through 3D Gaussian Splatting

Introduction

The endeavor to comprehend and interpret 3D scenes using open-vocabulary descriptions is a challenging yet significant task in computer vision, pivotal for the advancements in augmented reality and robotics. Traditional methodologies encompassing Neural Radiance Fields (NeRFs) and other 3D representations have paved pathways to analyze 3D scenes. This paper introduces "Semantic Gaussians," an innovative approach employing 3D Gaussian Splatting for open-vocabulary scene understanding. By distilling pre-trained 2D semantics into 3D, this method demonstrates notable improvements over existing strategies without necessitating additional training that NeRF-based techniques require.

Methodology

The core of Semantic Gaussians is the introduction of a semantic component to 3D Gaussian points, effectively enabling semantic understanding of scenes through a two-fold process: a projection framework and a 3D semantic network.

Versatile Projection Framework: At the heart of this approach is the mapping of 2D semantic features onto 3D Gaussian points. This is achieved by establishing correspondence between 2D pixels and 3D points through projection and thereafter assigning semantic features to each 3D Gaussian point. The framework is flexible and supports various pre-trained 2D models, such as CLIP or OpenSeg, facilitating the use of pixel-wise semantic features from 2D RGB images to enhance scene understanding.
3D Semantic Network: To complement projection, a 3D semantic network directly predicts semantic components from raw 3D Gaussians. Utilizing a 3D sparse convolution network (e.g., MinkowskiNet), this model processes RGB Gaussians to predict semantic embeddings, allowing rapid inference of semantic components.

The integration of these two processes aids in realizing a detailed open-vocabulary understanding of 3D scenes from both 2D and 3D perspectives.

Experiments and Results

The effectiveness of Semantic Gaussians is evaluated through several applications, demonstrating significant improvements in the domains of semantic segmentation, object part segmentation, scene editing, and spatiotemporal segmentation.

Semantic Segmentation on ScanNet-20

In a comparative paper on the ScanNet-20 dataset for semantic segmentation, Semantic Gaussians outperformed existing methods, achieving a 4.2\% mIoU and 4.0\% mAcc improvement. Notably, the versatile projection framework and the 3D semantic network contributed distinctly to this performance enhancement, illustrating the method's efficacy in integrating semantic knowledge from 2D models into a 3D context.

Object Part Segmentation, Scene Editing, and Spatiotemporal Segmentation

Further qualitative evaluation on tasks such as object part segmentation and scene editing revealed Semantic Gaussians' versatility and superior performance over baseline 2D and 3D methods. Specifically, in scene editing, the method demonstrated its capacity to accurately interpret and modify scenes through language-guided commands, showcasing its potential in interactive applications.

Discussion

Semantic Gaussians introduce a novel paradigm in 3D scene understanding by efficiently incorporating semantic information from pre-trained 2D sources into 3D environments. This method not only broadens the scope of scene analysis but also provides a platform for more intuitive human-computer interactions in immersive environments.

However, the performance of Semantic Gaussians is contingent upon the quality of input from pre-trained 2D models and the accurate representation of scenes through 3D Gaussians. Future developments in both pre-trained 2D models and 3D Gaussian Splatting techniques are expected to further enhance the capabilities of Semantic Gaussians.

Conclusion

Semantic Gaussians set a new standard in open-vocabulary 3D scene understanding by ingeniously leveraging 3D Gaussian Splatting. With its flexible framework and direct prediction capabilities, it holds promise for advancing applications in augmented reality, robotics, and beyond. This research paves the way for a deeper integration of linguistic and visual data, promising exciting developments in the field of computer vision and artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1772669875703496962

https://twitter.com/taziku_co/status/1772975950424146391

https://twitter.com/CSVisionPapers/status/1772652105267892582