Weakly Supervised 3D Open-vocabulary Segmentation (2305.14093v4)

Published 23 May 2023 in cs.CV

Abstract: Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.

PDF HTML Abstract

Weakly Supervised 3D Open-Vocabulary Segmentation

The paper "Weakly Supervised 3D Open-vocabulary Segmentation" addresses a salient challenge in the computer vision community: achieving semantic segmentation of 3D scenes with an open vocabulary through weak supervision. This research is grounded in the significant need for open-vocabulary segmentation in diverse applications such as robot navigation, autonomous driving, and augmented reality. Traditional methods for 3D segmentation rely heavily on large, annotated datasets. However, the bottleneck here is the inherent limitation in vocabulary and the labor-intensive nature of dataset creation.

The authors approach this challenge by utilizing pre-trained foundation models, namely CLIP and DINO, in a novel weakly supervised framework. CLIP, known for its robust association of text and image data, offers a wealth of multimodal knowledge captured from internet-scale datasets. DINO, another crucial model employed, provides detailed information about scene layouts and object boundaries from unlabelled data.

Methodology

The methodology presented is both innovative and practical. The authors propose a pipeline that distills open-vocabulary multimodal knowledge from CLIP and DINO into Neural Radiance Fields (NeRF), with the segmentation being guided solely by open-vocabulary text descriptions. A critical aspect of the process is that no manual segmentation annotations are required, differentiating this work from existing supervised learning approaches and maintaining the integrity of open-vocabulary potential.

To address the challenge of extracting pixel-level features from CLIP—which typically operates at an image-level—the authors devised several mechanisms:

Hierarchical Image Patches and 3D Selection Volume: These components work in tandem to align CLIP’s image-level features to the pixel level without requiring fine-tuning. This alignment enables the transformation of textual descriptions into usable guidance for 3D segmentation.
Relevancy-Distribution Alignment (RDA) Loss: This loss function is introduced to resolve ambiguities in CLIP-derived features by aligning the segmentation probability distribution with class relevancies reflecting similarities between textual and visual features.
Feature-Distribution Alignment (FDA) Loss: To distill boundary information effectively from DINO features, FDA loss ensures that segments with similar visual features share similar label distributions, while dissimilar segments have distinct label distributions.

Results and Implications

Extensive experiments demonstrate the effectiveness of the proposed method. Remarkably, the approach exceeds the performance of some fully supervised models in specific scenarios, showcasing the viability of learning robust 3D segmentation from 2D images and text-image pairs. This achievement signals a profound impact on the future of scalable and accessible computer vision technologies.

The implications of these findings are multifaceted:

Theoretical: This work paves the way for further research into distilling complex multimodal knowledge into unified frameworks for various computer vision tasks. It challenges traditional paradigms that prefer large annotated datasets in favor of more flexible, annotation-free learning methods.
Practical: The reduction in reliance on detailed manual annotations could greatly expedite developing real-world applications where data diversity and segmentation detail are crucial.

In summary, this research presents a commendable stride towards practical and efficient 3D scene understanding, harnessing weak supervision to unlock the potential of pre-trained foundation models without sacrificing the breadth of vocabulary or the need for exhaustive annotations. Future work in artificial intelligence could explore enhancing this framework, discovering further synergies between foundational models and developing even more robust, refined segmentation techniques.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Kunhao Liu (7 papers)
Fangneng Zhan (53 papers)
Jiahui Zhang (64 papers)
Muyu Xu (5 papers)
Yingchen Yu (24 papers)
Abdulmotaleb El Saddik (49 papers)
Christian Theobalt (251 papers)
Eric Xing (127 papers)
Shijian Lu (151 papers)

Citations (48)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Kunhao-Liu/3D-OVS: [NeurIPS 2023] Weakly Supervised 3D Open-vocabulary Segmentation (120 stars)