Cross Modal Transformer: Towards Fast and Robust 3D Object Detection (2301.01283v3)

Published 3 Jan 2023 in cs.CV

Abstract: In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on nuScenes test set while maintaining fast inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https://github.com/junjie18/CMT.

Citations (37)

View on Semantic Scholar

Summary

The paper presents CMT, a transformer-based model that integrates image and LiDAR data as tokens, eliminating the need for explicit view transformations.
It employs a Coordinates Encoding Module and position-guided queries to achieve 74.1% NDS on nuScenes, setting a new state-of-the-art benchmark.
Robustness is enhanced via masked-modal training, ensuring reliable performance even when sensor modalities fail.

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

The paper "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection" introduces the Cross Modal Transformer (CMT), a novel approach to 3D multi-modal detection. The research aims to create a streamlined end-to-end framework that integrates camera and LiDAR data without relying on explicit view transformations, thereby optimizing both accuracy and speed in object detection.

Key Contributions and Methodology

CMT distinguishes itself by treating both image and point cloud inputs as tokens in a transformer-based architecture. This approach contrasts with traditional methods that often require complex operations like view transformations or feature alignment between modalities. Instead, CMT employs a simple yet effective Coordinates Encoding Module (CEM) to integrate 3D positional information into these tokens, facilitating seamless interaction and robust detection.

The framework utilizes position-guided object queries inspired by DETR (DEtection TRansformers) for direct interaction with multi-modal inputs during the transformer decoding process. This integration allows CMT to predict 3D bounding boxes accurately, demonstrated by its performance of 74.1% NDS on the nuScenes test set, achieving state-of-the-art (SoTA) results while maintaining efficient inference speeds.

Numerical Results and Robustness

A significant aspect is the robustness of CMT under sensor failure conditions, such as absent LiDAR data. In these scenarios, CMT proves to sustain respectable performance levels, reminiscent of leading vision-based 3D detectors. This robustness is further enhanced by a masked-modal training strategy that randomly trains the model using only single modalities, promoting resilience against sensor failures.

Furthermore, the paper highlights various comparative studies, illustrating how CMT consistently outperforms previous SoTA models, including BEVFusion and TransFusion, across multiple benchmarks.

Implications for Future Research

The implications of this research are notable in both practical and theoretical contexts. Practically, CMT offers a simplified yet effective model for real-time applications in autonomous systems, particularly in scenarios where sensor reliability cannot be guaranteed. Theoretically, it provides a foundation for future exploration in end-to-end multi-modal integration using transformers, fostering potential advancements in 3D object detection systems.

Conclusion

In summary, the paper presents a significant advancement in 3D object detection methodologies by leveraging transformers to unify image and point cloud data effectively. CMT not only achieves impressive performance metrics but also enhances system robustness, setting a strong precedent for future research in similar domains. As AI continues to evolve, such integrative approaches will likely form the backbone of increasingly complex and multifaceted detection systems.

PDF Markdown

Related Papers

GitHub

GitHub - junjie18/CMT: [ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection (329 stars)