Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection (2303.11926v2)

Published 21 Mar 2023 in cs.CV

Abstract: In this paper, we propose a long-sequence modeling framework, named StreamPETR, for multi-view 3D object detection. Built upon the sparse query design in the PETR series, we systematically develop an object-centric temporal mechanism. The model is performed in an online manner and the long-term historical information is propagated through object queries frame by frame. Besides, we introduce a motion-aware layer normalization to model the movement of the objects. StreamPETR achieves significant performance improvements only with negligible computation cost, compared to the single-frame baseline. On the standard nuScenes benchmark, it is the first online multi-view method that achieves comparable performance (67.6% NDS & 65.3% AMOTA) with lidar-based methods. The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS. Code has been available at https://github.com/exiawsh/StreamPETR.git.

Citations (145)

View on Semantic Scholar

Summary

The paper introduces StreamPETR, which uses object-centric temporal modeling to efficiently propagate object queries and integrate long-term historical data.
It deploys a motion-aware layer normalization strategy to decouple ego motion from object movement, enhancing detection accuracy without heavy computation.
Experiments on the nuScenes benchmark show strong performance with 67.6% NDS, 65.3% AMOTA, and a lightweight version achieving 45.0% mAP at 31.7 FPS.

Object-Centric Temporal Modeling for 3D Object Detection

The presented paper explores an innovative approach to enhancing multi-view 3D object detection by introducing an object-centric temporal modeling framework named StreamPETR. This work is rooted in the sparse query design of the PETR series and leverages a novel paradigm to address the challenges of previous methods while maintaining computational efficiency.

Framework Overview

StreamPETR is engineered to process data in an online manner, allowing for the propagation of long-term historical information through object queries, frame by frame. The core advancement lies in adopting an object-centric temporal mechanism instead of traditional bird-eye-view (BEV) or perspective view-based approaches. This method is particularly adept at integrating temporal information, which is crucial for detecting occluded objects and tracking moving targets.

Methodological Innovations

The paper distinguishes itself by deploying a motion-aware layer normalization (MLN) explicitly aimed at modeling object movement. The MLN is key to decoupling the motion of the ego vehicle from surrounding objects, thus enhancing accuracy without imposing substantial computational burdens.

Key elements of the proposed method include:

Object Queries: These serve as the hidden states for temporal propagation, allowing for efficient modeling of moving objects.
Memory Queue: A strategically designed queue facilitates the recurrent update of object queries, ensuring sustained temporal interaction.
Propagation Transformer: This component includes temporal and spatial interaction mechanisms, further refined by the MLN to address motion dynamics effectively.

Experimental Results

The effectiveness of StreamPETR is validated using the nuScenes benchmark, where it achieves notable performance improvements. It is the first algorithm to offer camera-based detection results comparable to LIDAR-based methods, with an NDS of 67.6% and an AMOTA of 65.3%. Furthermore, a lightweight version demonstrates superior speed and mAP compared to state-of-the-art solutions like SOLOFusion, providing a competitive edge with 45.0% mAP at 31.7 FPS.

Implications and Future Directions

StreamPETR represents a significant stride in 3D object detection, particularly for applications in autonomous driving. The object-centric perspective, combined with efficient temporal interaction mechanisms, reduces computational loads, thus offering a scalable and robust solution for real-time applications.

The paper also paves the way for future research in the domain of AI-driven perception systems. Understanding the nuanced motion dynamics and optimizing temporal data integration without compromising speed remain open areas for exploration. Further development could involve experimenting with various architectures to generalize these insights across diverse datasets and scenarios.

Conclusion

StreamPETR highlights the efficacy of object-centric temporal modeling in 3D object detection, providing a pragmatic balance between accuracy and computational efficiency. This work substantially contributes to advancing the capabilities of camera-based perception systems in dynamic environments. As this domain evolves, the insights from this research could inspire further innovation in AI-driven detection frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - exiawsh/StreamPETR: [ICCV 2023] StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection (558 stars)