GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds (2212.03010v3)

Published 6 Dec 2022 in cs.CV

Abstract: Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. Code will be released at https://github.com/Nightmare-n/GD-MAE.

Authors (8)

Honghui Yang (12 papers)
Tong He (124 papers)
Jiaheng Liu (100 papers)
Hua Chen (138 papers)
Boxi Wu (36 papers)
Binbin Lin (50 papers)
Xiaofei He (70 papers)
Wanli Ouyang (358 papers)

Citations (51)

View on Semantic Scholar

Summary

The paper introduces a generative decoder that hierarchically reconstructs masked geometric details, streamlining traditional complex designs.
The study utilizes a Sparse Pyramid Transformer encoder to capture multi-scale features from irregular, sparse 3D point clouds effectively.
Empirical results demonstrate state-of-the-art performance on Waymo, KITTI, and ONCE datasets, achieving high detection accuracy even with limited labeled data.

Overview of GD-MAE Paper

The research paper titled "GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds" addresses the challenges of applying Masked Autoencoders (MAE) to the domain of large-scale 3D point clouds captured by LiDAR systems. The irregularity and sparsity of point clouds present unique difficulties that differ from traditional 2D image applications where MAEs have shown considerable promise. Unlike previous 3D MAE frameworks that concentrated on complex decoder designs or sophisticated masking techniques, the authors propose a simple yet effective paradigm utilizing a Generative Decoder for MAE (GD-MAE). This method demonstrates improved performance across various benchmarks while maintaining computational efficiency.

Core Contributions

Generative Decoder Design: The novel aspect of the GD-MAE approach lies in its generative decoder, which automatically integrates surrounding context to reconstruct masked geometric information in a hierarchical manner. This eliminates the need for heuristic decoder designs and offers extensive flexibility in experimenting with various masking strategies. The proposed structure accomplishes this while only adding about 12% latency relative to conventional methods.
Sparse Pyramid Transformer (SPT) Encoder: To effectively encode the hierarchical structure of 3D point clouds, the authors introduce the Sparse Pyramid Transformer (SPT). This multi-scale encoder leverages sparse convolution and transformer mechanisms to aggregate context over extensive spatial areas, providing a robust latent feature representation necessary for downstream tasks like 3D object detection.
Masked Autoencoder Strategy: GD-MAE employs a distinct masking strategy where a high mask ratio is applied to point clouds, and the reconstructive model fills in the missing elements. The approach is adaptable to various mask granularities, be it block-wise, patch-wise, or point-wise, which align with different levels of task difficulty and detail recovery.

Empirical Validation

The effectiveness of GD-MAE is demonstrated on several large-scale datasets, including Waymo, KITTI, and ONCE. Notable results include:

Waymo Dataset: The approach reaches state-of-the-art detection accuracy, achieving comparable performance even when trained on 20% of the labeled data, emphasizing its robustness and efficiency in leveraging unlabeled data.
ONCE and KITTI Datasets: Consistent improvement is observed across evaluation metrics, illustrating the generalization capacity and versatility of GD-MAE across different datasets and environments.

Implications and Future Work

GD-MAE represents a meaningful contribution to the pre-training of 3D vision models, particularly in robotics and autonomous vehicle domains. By simplifying the decoder design and enhancing label efficiency, GD-MAE supports more scalable and adaptable self-supervised learning in 3D point cloud contexts.

Future research directions could explore the applicability of GD-MAE to other forms of sparse data and its potential integration with multi-modal systems combining vision and LiDAR inputs. Continuous refining of encoder-decoder architectures could also yield further computational benefits and accuracy enhancements across an even broader set of 3D detection and understanding tasks. The release of the GD-MAE code-base will likely facilitate these explorations, promoting further advancements in the field.

PDF Markdown

Related Papers

GitHub

GitHub - Nightmare-n/GD-MAE: GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds (CVPR 2023) (103 stars)