CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer (2209.06535v2)

Published 14 Sep 2022 in cs.CV, cs.AI, and cs.RO

Abstract: Camera and radar sensors have significant advantages in cost, reliability, and maintenance compared to LiDAR. Existing fusion methods often fuse the outputs of single modalities at the result-level, called the late fusion strategy. This can benefit from using off-the-shelf single sensor detection algorithms, but late fusion cannot fully exploit the complementary properties of sensors, thus having limited performance despite the huge potential of camera-radar fusion. Here we propose a novel proposal-level early fusion approach that effectively exploits both spatial and contextual properties of camera and radar for 3D object detection. Our fusion framework first associates image proposal with radar points in the polar coordinate system to efficiently handle the discrepancy between the coordinate system and spatial properties. Using this as a first stage, following consecutive cross-attention based feature fusion layers adaptively exchange spatio-contextual information between camera and radar, leading to a robust and attentive fusion. Our camera-radar fusion approach achieves the state-of-the-art 41.1% mAP and 52.3% NDS on the nuScenes test set, which is 8.7 and 10.8 points higher than the camera-only baseline, as well as yielding competitive performance on the LiDAR method.

Authors (4)

Youngseok Kim (31 papers)
Sanmin Kim (6 papers)
Jun Won Choi (43 papers)
Dongsuk Kum (14 papers)

Citations (59)

View on Semantic Scholar

Summary

The paper introduces an early fusion framework that aligns camera semantics with radar data using a Soft Polar Association strategy.
It employs a Spatio-Contextual Fusion Transformer with cross-attention layers to merge spatial and contextual features, significantly enhancing detection accuracy.
The method achieves 41.1% mAP and 52.3% NDS on nuScenes, outperforming camera-only systems and rivaling LiDAR performance.

An Overview of CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer

The paper "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer" addresses the challenges and opportunities presented by the fusion of camera and radar data in the task of 3D object detection, a critical component for autonomous driving systems. This work introduces a novel proposal-level early fusion framework that effectively combines spatial and contextual information from camera and radar sensors to improve 3D object detection accuracy.

Motivation and Contributions

Camera and radar systems each offer distinct advantages and limitations when it comes to object detection. Cameras provide rich semantic information but can struggle with accurate depth perception, while radar systems excel in detecting objects at range with weather condition resilience but suffer from low resolution and measurement ambiguities. Traditional fusion strategies, predominantly late-fusion approaches, have not fully leveraged these complementary strengths.

The paper proposes an innovative early fusion strategy, CRAFT, to exploit these complementary properties more effectively. By associating image proposals with radar points in the polar coordinate system, the framework alleviates discrepancies between camera and radar data, thus enabling more efficient fusion at the feature level instead of at the output level.

Notable contributions of the paper include:

The Soft Polar Association (SPA) strategy that adapts point querying in the polar coordinate system, utilizing adaptive thresholds based on the variance of camera estimates.
The Spatio-Contextual Fusion Transformer (SCFT), which comprises cross-attention layers for adaptive exchange of spatial and contextual information between modalities.
Implementation results demonstrating significant improvements in mean average precision (mAP) and nuScenes detection score (NDS) benchmarks over existing camera-only and competing camera-radar fusion approaches.

Key Results and Comparisons

CRAFT sets a new benchmark with 41.1% mAP and 52.3% NDS on the nuScenes dataset, showing an 8.7 and 10.8 point improvement over a baseline camera-only approach. This performance is on par with LiDAR-based methods, suggesting a compelling use case for camera-radar fusion as a cost-effective alternative to LiDAR systems in autonomous vehicles.

Implications and Future Directions

The implications of this research are twofold. Practically, the improved detection performance of CRAFT, achieved with minimal computation overhead, makes it suitable for real-world autonomous driving applications where cost and reliability are critical considerations. Theoretically, the work extends the application of attention-based architectures to multi-modal sensor fusion, revealing new avenues for future research in combining diverse sensor data.

As a future development, there is potential to further refine spatio-contextual fusion methods and to extend the approach to integrate additional sensors, such as LiDAR. Additionally, exploring dynamic scenes and incorporating temporal information could provide further gains in object detection robustness and accuracy.

Overall, this research contributes to the field of autonomous driving by advancing the state of the art in multi-sensor fusion, offering both a robust methodological framework and practical implementation for enhanced 3D object detection.

PDF Markdown

Related Papers

YouTube

Show All Videos