PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection (2311.17770v1)

Published 29 Nov 2023 in cs.CV and cs.RO

Abstract: This paper shows the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors. Pillar-based methods mainly employ randomly initialized 2D convolution neural network (ConvNet) for feature extraction and fail to enjoy the benefits from the backbone scaling and pretraining in the image domain. To show the scaling-up capacity in point clouds, we introduce the dense ConvNet pretrained on large-scale image datasets (e.g., ImageNet) as the 2D backbone of pillar-based detectors. The ConvNets are adaptively designed based on the model size according to the specific features of point clouds, such as sparsity and irregularity. Equipped with the pretrained ConvNets, our proposed pillar-based detector, termed PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets. Our code shall be released upon acceptance.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces PillarNeSt, which adapts large-scale pre-trained ConvNets as 2D backbones to enhance pillar-based 3D object detection.
The methodology fine-tunes dense ConvNets with large kernels and extra early-stage blocks to capture detailed features in sparse point clouds.
Experimental results show PillarNeSt-Large achieves 66.9% mAP and 71.6% NDS on the nuScenes test set, setting new benchmarks for robustness.

Introduction

3D object detection is critical for autonomous vehicles and robotics as it enables an understanding of the environment. Pillar-based 3D detectors excel due to their efficiency and favorable speed for practical applications. However, existing methods often overlook the advantages of scaling and pretraining in the image domain, hindering their capacity for detecting objects in 3D point clouds.

Scaling and Pretraining Adaptation

The research introduces dense ConvNets pre-trained on large-scale image datasets, such as ImageNet, as the 2D backbones for pillar-based detectors. ConvNets are fine-tuned to address the unique characteristics like sparsity and irregularity found in point clouds. The adaptively designed 2D backbones are then utilized within a new framework named PillarNeSt, significantly outperforming existing 3D detectors on standard benchmarks. Remarkably, the paper demonstrates that as these ConvNets scale up, the performance of 3D object detection improves.

Model Architecture and Design Rules

The PillarNeSt architecture consists of various model sizes, from Tiny to Large, balancing performance and inference speed. Key design elements include large kernels for an expanded receptive field and more blocks in earlier stages for detailed feature extraction, without downsampling in the initial stages to maintain the integrity of fine-grained point cloud data. The paper also details the weight initialization strategy adapted for fine-tuning pre-trained ConvNets from image domains to suit the task of point cloud object detection.

Experimental Results and Effectiveness

PillarNeSt significantly surpasses existing methods on the nuScenes and Argoversev2 datasets, with the larger model variants achieving the highest marks. For instance, without additional test-time improvements, PillarNeSt-Large hits 66.9% mAP and 71.6% NDS on the nuScenes test set, confirming the system's robustness. Extensive ablation studies showcase the beneficial impact of the proposed backbone design rules, including the importance of more blocks in the early stages and not downsampling the first stage, which aids in capturing more detailed information.

Conclusions and Future Directions

The paper provides a novel and insightful perspective on applying backbone scaling and pretraining from the 2D image domain to boost 3D object detection in point clouds. PillarNeSt's success lays down a promising path for future work, which may include explorations into more efficient handling of high-resolution pseudo-images and the use of generative pretraining strategies to further fine-tune backbones for point cloud 3D object detection.

PDF Markdown