- The paper introduces PillarNeSt, which adapts large-scale pre-trained ConvNets as 2D backbones to enhance pillar-based 3D object detection.
- The methodology fine-tunes dense ConvNets with large kernels and extra early-stage blocks to capture detailed features in sparse point clouds.
- Experimental results show PillarNeSt-Large achieves 66.9% mAP and 71.6% NDS on the nuScenes test set, setting new benchmarks for robustness.
Introduction
3D object detection is critical for autonomous vehicles and robotics as it enables an understanding of the environment. Pillar-based 3D detectors excel due to their efficiency and favorable speed for practical applications. However, existing methods often overlook the advantages of scaling and pretraining in the image domain, hindering their capacity for detecting objects in 3D point clouds.
Scaling and Pretraining Adaptation
The research introduces dense ConvNets pre-trained on large-scale image datasets, such as ImageNet, as the 2D backbones for pillar-based detectors. ConvNets are fine-tuned to address the unique characteristics like sparsity and irregularity found in point clouds. The adaptively designed 2D backbones are then utilized within a new framework named PillarNeSt, significantly outperforming existing 3D detectors on standard benchmarks. Remarkably, the paper demonstrates that as these ConvNets scale up, the performance of 3D object detection improves.
Model Architecture and Design Rules
The PillarNeSt architecture consists of various model sizes, from Tiny to Large, balancing performance and inference speed. Key design elements include large kernels for an expanded receptive field and more blocks in earlier stages for detailed feature extraction, without downsampling in the initial stages to maintain the integrity of fine-grained point cloud data. The paper also details the weight initialization strategy adapted for fine-tuning pre-trained ConvNets from image domains to suit the task of point cloud object detection.
Experimental Results and Effectiveness
PillarNeSt significantly surpasses existing methods on the nuScenes and Argoversev2 datasets, with the larger model variants achieving the highest marks. For instance, without additional test-time improvements, PillarNeSt-Large hits 66.9% mAP and 71.6% NDS on the nuScenes test set, confirming the system's robustness. Extensive ablation studies showcase the beneficial impact of the proposed backbone design rules, including the importance of more blocks in the early stages and not downsampling the first stage, which aids in capturing more detailed information.
Conclusions and Future Directions
The paper provides a novel and insightful perspective on applying backbone scaling and pretraining from the 2D image domain to boost 3D object detection in point clouds. PillarNeSt's success lays down a promising path for future work, which may include explorations into more efficient handling of high-resolution pseudo-images and the use of generative pretraining strategies to further fine-tune backbones for point cloud 3D object detection.