CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP (2301.04926v2)

Published 12 Jan 2023 in cs.CV

Abstract: Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP to help the learning in 3D scene understanding has yet to be explored. In this paper, we make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet effective framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network. We show that the pre-trained 3D network yields impressive performance on various downstream tasks, i.e., annotation-free and fine-tuning with labelled data for semantic segmentation. Specifically, built upon CLIP, we design a Semantic-driven Cross-modal Contrastive Learning framework that pre-trains a 3D network via semantic and spatial-temporal consistency regularization. For the former, we first leverage CLIP's text semantics to select the positive and negative point samples and then employ the contrastive loss to train the 3D network. In terms of the latter, we force the consistency between the temporally coherent point cloud features and their corresponding image features. We conduct experiments on SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08% mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100% labelled data, our method significantly outperforms other self-supervised methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we demonstrate the generalizability for handling cross-domain datasets. Code is publicly available https://github.com/runnanchen/CLIP2Scene.

PDF Abstract

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

The paper "CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP" presents an innovative framework that leverages the Contrastive Language-Image Pre-training (CLIP) model to enhance 3D scene understanding tasks, particularly focusing on semantic segmentation. The proposed CLIP2Scene framework is tailored to address the challenges of label-efficient learning within 3D environments, utilizing the robust representational capabilities of CLIP which have been effectively demonstrated in 2D applications.

Overview of CLIP2Scene

At the heart of the research is the CLIP2Scene framework, which transfers knowledge from CLIP—originally designed for 2D image-text correlation—into a 3D domain. This is accomplished by developing a Semantic-driven Cross-modal Contrastive Learning framework for pre-training 3D point cloud networks. The methodological approach involves semantic and spatial-temporal consistency regularization, which ensures effective knowledge transfer while preserving the semantic integrity of point clouds for various downstream tasks such as semantic segmentation.

Key Methodological Contributions

Semantic-driven Consistency Regularization: The method employs CLIP's text semantics to draw positive and negative point samples, mitigating the common issues of contrastive learning conflicts which arise from misaligned positive-negative pairs. This aspect is crucial as it allows for a more refined and semantically aware representation of the 3D scene.
Spatial-Temporal Consistency Regularization: This component enforces consistency between temporally coherent point cloud features and corresponding image features. It integrates cross-modal data, improving the accuracy of 3D feature representations by leveraging temporal coherence, aligning well with the temporal dynamics present in point cloud data.
Zero-shot and Few-shot Capabilities: The framework is demonstrated to perform effectively in zero-shot settings, achieving notable performance metrics without relying on annotated data. It shows a remarkable mean Intersection over Union (mIoU) of 20.8% and 25.08% for nuScenes and ScanNet datasets respectively, marking a substantial achievement in annotation-free 3D semantic segmentation.
Improved Fine-tuning Performance: The approach outperforms existing self-supervised methods when fine-tuned with either sparse (1%) or comprehensive (100%) labeled datasets, providing an 8% and 1% improvement in mIoU, respectively.

Implications and Future Directions

The implications of this work are twofold: practical and theoretical. Practically, it demonstrates how language-image pre-trained models such as CLIP can be extended to augment 3D scene parsing tasks, reducing the dependency on vast amounts of labeled 3D data. Theoretically, it opens up new avenues in cross-modal learning by revealing the potential of semantic regularization features alongside spatial-temporal consistency for enhanced model generalization in 3D environments.

Future research could explore optimizing the transfer of pre-trained 2D knowledge to 3D networks, exploring the calibration between modalities further, and extending the capabilities of models to different types of 3D data. As models continue to evolve, integrating linguistic, visual, and spatial data representations can profoundly enhance scene understanding tasks across numerous applications, including autonomous vehicles and robotic navigation.

Overall, CLIP2Scene marks a significant advancement in utilizing multimodal pre-trained models for 3D understanding tasks, offering insights and methodologies that could be pivotal for future developments in label-efficient learning paradigms in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Runnan Chen (32 papers)
Youquan Liu (16 papers)
Lingdong Kong (49 papers)
Xinge Zhu (62 papers)
Yuexin Ma (97 papers)
Yikang Li (64 papers)
Yuenan Hou (31 papers)
Yu Qiao (563 papers)
Wenping Wang (184 papers)

Citations (106)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - runnanchen/CLIP2Scene (138 stars)