Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders (2206.09900v7)

Published 20 Jun 2022 in cs.CV

Abstract: Current perception models in autonomous driving heavily rely on large-scale labelled 3D data, which is both costly and time-consuming to annotate. This work proposes a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds using masked autoencoders (MAE). While existing masked point autoencoding methods mainly focus on small-scale indoor point clouds or pillar-based large-scale outdoor LiDAR data, our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE, specifically designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE takes advantage of the gradually sparse voxel occupancy structure of outdoor LiDAR point clouds and incorporates a range-aware random masking strategy and a pretext task of occupancy prediction. By randomly masking voxels based on their distance to the LiDAR and predicting the masked occupancy structure of the entire 3D surrounding scene, Occupancy-MAE encourages the extraction of high-level semantic information to reconstruct the masked voxel using only a small number of visible voxels. Extensive experiments demonstrate the effectiveness of Occupancy-MAE across several downstream tasks. For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half and improves small object detection by approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For multi-object tracking, Occupancy-MAE enhances training from scratch by approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at https://github.com/chaytonmin/Occupancy-MAE.

References (74)

Authors (6)

Chen Min (17 papers)
Xinli Xu (17 papers)
Dawei Zhao (22 papers)
Liang Xiao (80 papers)
Yiming Nie (9 papers)
Bin Dai (60 papers)

Citations (39)

View on Semantic Scholar

Summary

An Expert Overview of "Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders"

The paper "Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders" introduces a significant advancement in self-supervised learning for LiDAR-based 3D perception, specifically designed for autonomous driving applications. The central proposition is the Occupancy-MAE, a novel self-supervised framework leveraging masked autoencoders to pre-train models on large-scale unlabeled outdoor LiDAR point clouds. This approach addresses the challenge of dependence on extensive labeled 3D datasets, which are costly and time-consuming to annotate.

In essence, the methodology introduced pivots around a masked autoencoding strategy, targeted particularly at the voxel-based representation of LiDAR data. The paper acknowledges the limitations of existing masked point autoencoding methodologies, which have predominantly catered to small-scale indoor point clouds or pillar-based representations. Occupancy-MAE differentiates itself by adopting a voxel-based approach, which more accurately reflects the sparse structural characteristics encountered in real-world LiDAR data used for autonomous vehicles.

Methodological Innovations

Occupancy-MAE introduces a self-supervised masked occupancy pre-training method that hinges on three key components:

Masked Autoencoders (MAE): The framework employs masked autoencoders as a pre-training strategy to extract high-level semantic information by reconstructing the masked occupancy structure of the LiDAR point clouds. This marks a departure from earlier methods that focused merely on reconstructing individual points.
Range-aware Random Masking Strategy: Traditional random masking approaches are transcended by a novel method that considers the varying density of LiDAR data points based on their distance from the sensor. This enhances training efficacy by optimizing the masking strategy based on spatial voxel sparsity.
3D Occupancy Prediction: Unlike existing methods that emphasize point regression, Occupancy-MAE introduces occupancy prediction as the pretext task. This task involves predicting the occupancy status of voxels, encouraging the learning of robust and representative features for 3D perception tasks.

Experimental Validation

The paper presents extensive experimental results across various datasets such as ONCE, KITTI, Waymo, and nuScenes. The results substantiate that Occupancy-MAE offers substantial performance improvements across several downstream tasks. Specifically, for 3D object detection, it significantly reduces the labeled data requirement for car detection in the KITTI dataset and improves small object detection accuracy in the Waymo dataset. It also consistently outperforms models trained from scratch in 3D semantic segmentation and multi-object tracking tasks, with evident improvements in metrics such as mIoU, AMOTA, and AMOTP.

Implications and Future Directions

The implications of Occupancy-MAE extend both practically and theoretically. Practically, the framework enhances the data efficiency of 3D perception models, a critical advancement considering the high cost associated with annotating large-scale 3D datasets. This capability is particularly significant for autonomous driving systems, where data scarcity and high annotation costs pose substantial barriers.

Theoretically, the introduction of occupancy prediction as a pretext task invites further exploration into how 3D structures are semantically understood by neural networks. The method’s ability to generalize across different downstream tasks and datasets suggests a robustness that could be foundational for future research in self-supervised learning paradigms, not merely limited to autonomous driving.

Moving forward, research could branch into pre-training with high-resolution data and understanding temporal sequences in LiDAR point clouds. Furthermore, adapting this methodology to incorporate dynamic scene understanding through multi-frame fusion and extending it across diversified large-scale datasets might unlock newer dimensions in autonomous perception.

The open availability of the Occupancy-MAE code promises to catalyze further research and application of this framework, potentially setting a new benchmark in the self-supervised learning methodology for LiDAR-based 3D perception.

PDF Markdown

GitHub

GitHub - chaytonmin/Occupancy-MAE: Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders (249 stars)