Self-Supervised Pretraining of 3D Features on any Point-Cloud (2101.02691v1)

Published 7 Jan 2021 in cs.CV

Abstract: Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and time consuming to label. We present a simple self-supervised pertaining method that can work with any 3D data - single or multiview, indoor or outdoor, acquired by varied sensors, without 3D registration. We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining. We set a new state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP). Our pretrained models are label efficient and improve performance for classes with few examples.

Authors (4)

Zaiwei Zhang (16 papers)
Rohit Girdhar (43 papers)
Armand Joulin (81 papers)
Ishan Misra (65 papers)

Citations (249)

View on Semantic Scholar

Summary

Insights from "Self-Supervised Pretraining of 3D Features on any Point-Cloud"

The paper "Self-Supervised Pretraining of 3D Features on any Point-Cloud" by Zaiwei Zhang et al. addresses a significant gap in the landscape of 3D computer vision: the limited use of pretraining for 3D recognition tasks due to the scarcity of large labeled datasets. The authors propose a self-supervised learning framework called DepthContrast that aims to overcome this limitation by leveraging unlabeled 3D data, thus extending successful paradigms from 2D computer vision and aligning with current trends favoring self-supervised approaches.

Methodology

DepthContrast is grounded on the instance discrimination framework wherein models are trained to differentiate between instances without explicit labels. The authors extend this methodology to 3D data by applying it to single-view depth maps, circumventing the need for multi-view constraints or 3D registration. This makes DepthContrast versatile across various data types—single/multi-view, indoor/outdoor scenes—and applicable to different 3D architectures like point cloud and voxel representations.

The paper introduces a joint pretraining strategy across different input formats, utilizing both within-format and across-format contrastive loss functions. This joint approach, crucially, aligns the feature representations from varied input types, enhancing the robustness and generality of the learned embeddings. The DepthContrast framework supports the learning of features across different architectures by considering different input representations as augmentations.

Results and Evaluation

Experimental evaluations demonstrate the efficacy of DepthContrast across a broad spectrum of benchmarks, including object detection on SUNRGBD and ScanNet, scene segmentation on datasets like S3DIS, and object classification on ModelNet. Particularly noteworthy is the state-of-the-art performance achieved in object detection benchmarks, where the pretrained models not only outperform those trained from scratch but also surpass models pretrained via supervised means.

Quantitative results highlight the significant gains in detection mean Average Precision (mAP), with notable improvements on tasks with limited annotated samples. DepthContrast shows exceptional label efficiency, achieving robust results with substantially reduced annotated data, which is critical for scaling AI applications in real-world scenarios where labeled data can be a bottleneck.

Implications and Speculation for the Future

The practical implications of this work are vast. By enabling effective pretraining in a label-scarce 3D setting, DepthContrast holds promise for accelerating developments in autonomous systems, robotics, and enhanced reality, where 3D perception is paramount. The ability to exploit abundant unannotated 3D data for feature learning may shift more 3D tasks towards self-supervised paradigms, reducing dependency on costly and time-intensive data labeling processes.

Theoretically, DepthContrast extends the boundary of 3D representation learning by showing that a straightforward adaptation of 2D contrastive techniques can yield superior results in 3D contexts. Its success paves the way for future explorations into integrating multi-modal self-supervised learning strategies, potentially incorporating auxiliary signals like text or sound to further enhance scene understanding.

In conclusion, this work represents a significant stride in 3D computer vision, offering a compelling case for the broader application of self-supervised techniques beyond traditional 2D domains. As models continue to scale both in data and architecture, methodologies like DepthContrast will be pivotal in ensuring these advances translate into practical, efficient learning frameworks that can drive innovation across various AI-driven fields.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos