Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning (2312.00663v1)

Published 1 Dec 2023 in cs.CV and cs.RO

Abstract: Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-LLMs, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-LLMs, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

PDF HTML Abstract

Understanding 3D Scenes with Limited Labels

Background

The task of 3D scene parsing has become increasingly important with the proliferation of 3D sensors like LiDAR and RGB-D cameras. Understanding 3D scenes involves complex tasks such as point cloud semantic segmentation, instance segmentation, and object detection. While deep neural networks have shown promising results in these areas, they typically require extensive labeled datasets for training, which can be expensive and time-consuming to obtain.

Challenges in 3D Scene Parsing

Two major challenges are faced when working with 3D recognition models:

Closed-set Assumption: Many models are only able to recognize categories they were trained on and struggle to generalize to novel classes that were not present in the training data.
Reliance on Large-Scale Labeled Data: Access to vast amounts of labeled data is usually necessary for good performance, which is not always feasible.

A Novel Approach

A new framework aims to address the issues of closed-set assumption and reliance on large-scale labeled data. This method, known as WS3D++, is tailored to work efficiently when the labeled scenes available for training are limited.

Unsupervised Learning for 3D Data

To help understand novel categories and efficiently use unlabeled data, several strategies are proposed:

Hierarchical Feature Alignment: This novel pre-training method extracts meaningful associations between the visual and linguistic features of large-scale LLMs and 3D point clouds. By utilizing rendering techniques to construct 2D views from 3D scenes and establish elaborate coarse-to-fine vision-language associations.
Knowledge Distillation: An effective knowledge distillation strategy is employed, transferring visual-language-aligned representations from pre-trained vision-LLMs to 3D neural networks.

Enhanced Performance

By fine-tuning with a combination of an energy-based optimization technique that incorporates boundary information and a new region-level contrastive learning strategy, the model can improve its ability to segment and detect objects in 3D space. The introduction of both components allows for better discrimination of instances and regions within a 3D scene, taking unlabeled data into consideration.

Benchmarked Success

The framework has been rigorously tested against large-scale benchmarks including ScanNet, SemanticKITTI, and S3DIS. The approach, termed WS3D++, ranks first in both semantic and instance segmentation tasks on the ScanNet benchmark. It beats state-of-the-art methods under conditions of limited labeled data for various indoor and outdoor datasets.

Extensive experiments with both indoor and outdoor scenes show its effectiveness in open-world few-shot learning and data-efficient learning.

Accessibility

In the interest of fostering research and development in this field, all codes, models, and data related to this framework will be made publicly available.

Key Takeaways

The WS3D++ framework offers a practical solution to the problem of 3D scene understanding with a limited amount of labeled data.
It utilizes a novel combination of feature-aligned pre-training, boundary-aware fine-tuning, and a multi-stage contrastive learning strategy.
Extensive experimentation confirms its leading performance in various scenarios, promising substantial improvements over current methods in data-efficient learning and open-world recognition.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kangcheng Liu (21 papers)
Yong-Jin Liu (66 papers)
Kai Tang (38 papers)
Ming Liu (421 papers)
Baoquan Chen (85 papers)

Citations (1)

View on Semantic Scholar