Attention-based Proposals Refinement for 3D Object Detection (2201.07070v3)

Published 18 Jan 2022 in cs.CV and cs.RO

Abstract: Recent advances in 3D object detection are made by developing the refinement stage for voxel-based Region Proposal Networks (RPN) to better strike the balance between accuracy and efficiency. A popular approach among state-of-the-art frameworks is to divide proposals, or Regions of Interest (ROI), into grids and extract features for each grid location before synthesizing them to form ROI features. While achieving impressive performances, such an approach involves several hand-crafted components (e.g. grid sampling, set abstraction) which requires expert knowledge to be tuned correctly. This paper proposes a data-driven approach to ROI feature computing named APRO3D-Net which consists of a voxel-based RPN and a refinement stage made of Vector Attention. Unlike the original multi-head attention, Vector Attention assigns different weights to different channels within a point feature, thus being able to capture a more sophisticated relation between pooled points and ROI. Our method achieves a competitive performance of 84.85 AP for class Car at moderate difficulty on the validation set of KITTI and 47.03 mAP (average over 10 classes) on NuScenes while having the least parameters compared to closely related methods and attaining an inference speed at 15 FPS on NVIDIA V100 GPU. The code is released at https://github.com/quan-dao/APRO3D-Net.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces APRO3D-Net, a refinement framework that employs Vector Attention to enhance ROI feature extraction for 3D object detection.
The approach replaces hand-crafted grid sampling with a data-driven method, achieving 84.85 AP on KITTI and 47.03 mAP on NuScenes.
Experiments demonstrate significant improvements in detection accuracy and efficiency, offering practical benefits for autonomous vehicle perception.

An Overview of Attention-based Proposals Refinement for 3D Object Detection

The paper "Attention-based Proposals Refinement for 3D Object Detection" presents a refinement framework for 3D object detection utilizing attention mechanisms. The enhancement of voxel-based Region Proposal Networks (RPN) through a novel refinement stage named APRO3D-Net, which employs Vector Attention for calculating Region of Interest (ROI) features, is the core contribution.

To improve the balance between accuracy and efficiency, recent methodologies have sought to refine ROI feature extraction in state-of-the-art 3D object detection frameworks. The conventional approach divides the proposals into grids to perform feature extraction at each grid point, synthesizing them to form the ROI features. Despite yielding high performance, this method involves cumbersome hand-crafted components such as grid sampling and set abstraction, requiring expert tuning.

APRO3D-Net Architecture

APRO3D-Net proposes a data-driven approach that reduces the dependency on hand-crafted elements by introducing Vector Attention—a variation of the multi-head attention that assigns diverse weights across different feature channels. This allows a more sophisticated relationship capture between pooled points and their respective ROIs. The model achieves 84.85 AP for the class 'Car' at moderate difficulty on the KITTI dataset and 47.03 mAP averaged over ten classes on the NuScenes dataset, optimizing parameter count and processing at 15 FPS on a NVIDIA V100 GPU.

The model architecture consists of a voxel-based RPN coupled with a refinement stage formed by the ROI Feature Encoder (RFE) modules. The stages include:

3D Backbone and RPN: Utilizing SECOND's architecture for feature extraction from voxelized LiDAR point clouds, which feeds into a 2D convolution-based RPN for the initial ROI proposal classification and regression tasks.
ROI Feature Encoder: The core of the refinement stage comprising:
- Feature Map Pooling: Converts backbone-produced feature maps into point-wise features, pooling them based on relative locations to enlarged ROIs.
- Position Encoding: Encodes position-related information, incorporating the geometrical context of ROIs into the point features.
- Attention Module: Utilizes Vector Attention for calculating attention weights across different channels, refining ROI features dynamically.
Detection Heads: Map refined ROI features to output confidence scores and bounding box regression vectors.

Experimental Evaluation and Performance

The proposed model is robustly evaluated on challenging benchmark datasets—KITTI and NuScenes. On KITTI, it competes prominently against contemporary benchmarks, achieving optimal results in terms of AP with lesser parameter overhead. For NuScenes, it outperforms existing methods across multiple object classes, substantiating its capability to handle varied object scales.

A series of ablation studies further elucidates how each design choice impacts overall performance. The use of Vector Attention notably enhances results over traditional scalar-based multi-head attention due to its ability to weigh the importance of each channel independently.

Implications and Future Directions

The introduction of Vector Attention into the 3D object detection pipeline alongside a pooling strategy for multi-scale feature fusion significantly improves object detection across varied classes and environments. Beyond theoretical implications, this approach suggests substantial practical improvements for autonomous vehicle perception systems, offering advancements in localization and object recognition speed and accuracy.

Future work could explore integrating additional sensor modalities such as cameras or radar with the APRO3D-Net framework to develop more robust 3D object detection capabilities. Furthermore, the adaptation of ROI features for tracking functionalities presents another promising trajectory for advancing autonomous perception systems.

Related Papers

GitHub

GitHub - quan-dao/APRO3D-Net: APRO3D-Net: Attention-based Proposals Refinement for 3D Object Detection (18 stars)

YouTube

Show All Videos