RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving (2001.03343v1)

Published 10 Jan 2020 in cs.CV, cs.RO, and eess.IV

Abstract: In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. Most successful 3D detectors take the projection constraint from the 3D bounding box to the 2D box as an important component. Four edges of a 2D box provide only four constraints and the performance deteriorates dramatically with the small error of the 2D detector. Different from these approaches, our method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. Training our method only uses the 3D properties of the object without the need for external networks or supervision data. Our method is the first real-time system for monocular image 3D detection while achieves state-of-the-art performance on the KITTI benchmark. Code will be released at https://github.com/Banconxuan/RTM3D.

Citations (295)

View on Semantic Scholar

Summary

The paper introduces a one-stage keypoint detection network that predicts 3D bounding box vertices and center for efficient monocular detection.
The paper leverages geometric constraints with a refined energy function to accurately infer object dimensions, location, and orientation.
The paper achieves real-time performance on the KITTI benchmark with minimal latency, presenting a viable alternative to LiDAR-based systems.

Systematic Evaluation of RTM3D: A Monocular 3D Object Detection Framework

This paper presents RTM3D, a novel framework designed to achieve real-time monocular 3D object detection using keypoints for applications in autonomous driving. The authors address a current limitation in 3D object detection relying predominantly on expensive LiDAR systems by proposing an efficient, image-only solution. This work strategically circumvents the need for 2D bounding box reliance, aiming for state-of-the-art performance on specific benchmarks such as KITTI.

Key Contributions

The main contributions of the paper include:

One-Stage Keypoint Detection Network: The authors develop a unique one-stage fully convolutional network that predicts nine keypoints representing the 3D bounding box vertices and center. This approach inherently provides 18 geometric constraints that aid in accurately inferring the 3D dimensions, location, and orientation of objects.
Geometric Constraint Utilization: By leveraging the geometric relationships between keypoints and their 3D projections, RTM3D minimizes projection errors through a refined energy function featuring multivariate equations optimizable by Gauss-Newton or Levenberg-Marquardt methods.
Multi-Scale Keypoint Feature Pyramid: The proposed Keypoint Feature Pyramid Network (KFPN) helps address scale variance and enhances the detection accuracy by incorporating multi-layer scale details into the feature detection process.
Real-Time Performance on KITTI: The framework is validated on the KITTI benchmark suite, showing not only competitive accuracy but also significant runtime efficiency compared to current image-based methods.

Evaluation and Results

RTM3D distinguishes itself by achieving a harmonious balance between speed and accuracy. With the use of two backbone architectures, ResNet-18 and DLA-34, the method showcases competitive 3D object detection results with minimal processing latency (as low as 0.035 seconds for ResNet-18 backbone). Notably, when comparing RTM3D's results to extensive prior approaches:

RTM3D's performance on the 3D detection and Bird's Eye View benchmarks exceeds competitors using time-intensive, two-stage frameworks.
Evaluation on various KITTI sets demonstrates robust performance across all difficulty settings (easy, moderate, hard).

Practical and Theoretical Implications

Practically, RTM3D's rapid inference speed and reduced hardware requirements make it a viable alternative to LiDAR-dependent systems, particularly for real-time applications in self-driving technology. Theoretically, this approach offers insights into optimizing keypoint detection and modeling the 3D constraints effectively without auxiliary data inputs.

Future Developments in AI

Moving forward, this research sets a promising precedent for further developments in autonomous navigation and real-time 3D perception using solely image-based input. Exploration into adaptive learning techniques may further bolster the adaptability of monocular 3D detection systems, enhancing performance across diverse environments and conditions.

In conclusion, RTM3D marks a significant stride in monocular 3D object detection, balancing computational efficiency with high accuracy and opening avenues for integration into real-world autonomous driving systems. The paper's methodical advancements and evaluations present a formidable baseline for subsequent research in cost-effective 3D perception technologies.

PDF Markdown

Related Papers

GitHub

GitHub - Banconxuan/RTM3D: The official PyTorch Implementation of RTM3D and KM3D for Monocular 3D Object Detection (454 stars)

YouTube

Show All Videos