PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images (2206.01256v3)

Published 2 Jun 2022 in cs.CV

Abstract: In this paper, we propose PETRv2, a unified framework for 3D perception from multi-view images. Based on PETR, PETRv2 explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection. More specifically, we extend the 3D position embedding (3D PE) in PETR for temporal modeling. The 3D PE achieves the temporal alignment on object position of different frames. A feature-guided position encoder is further introduced to improve the data adaptability of 3D PE. To support for multi-task learning (e.g., BEV segmentation and 3D lane detection), PETRv2 provides a simple yet effective solution by introducing task-specific queries, which are initialized under different spaces. PETRv2 achieves state-of-the-art performance on 3D object detection, BEV segmentation and 3D lane detection. Detailed robustness analysis is also conducted on PETR framework. We hope PETRv2 can serve as a strong baseline for 3D perception. Code is available at \url{https://github.com/megvii-research/PETR}.

Citations (297)

View on Semantic Scholar

Summary

The paper introduces a unified framework that integrates temporal modeling and data-adaptive 3D position encoding to improve multi-camera 3D perception.
The paper demonstrates that task-specific queries in a transformer decoder help achieve state-of-the-art performance in 3D object detection, BEV segmentation, and lane detection.
The paper validates PETRv2’s robustness under sensor noise and delays, ensuring reliable performance for autonomous driving applications.

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

The paper introduces PETRv2, a comprehensive framework designed to enhance 3D perception by integrating temporal modeling into multi-view image processing. PETRv2 builds upon PETR, extending its capabilities to better handle temporal information and multi-task learning, including BEV segmentation and 3D lane detection.

Key Features and Methodology

Temporal Modeling: PETRv2 incorporates a temporal modeling strategy by aligning 3D position embeddings (3D PE) across frames. This is achieved through 3D coordinates alignment, allowing the model to effectively improve spatial localization and motion estimation. The alignment process involves transforming the 3D coordinates of previous frames to the coordinate system of the current frame.
Multi-task Learning: PETRv2 efficiently tackles multi-task learning by leveraging task-specific queries. Each query type—detection queries, segmentation queries, and lane queries—is designed to operate within specific spatial parameters. These queries are processed by a transformer decoder, which utilizes updated multi-view image features for accurate task predictions.
Feature-guided Position Encoder (FPE): The introduction of an FPE allows for data-adaptive 3D PE generation. This innovation enables the integration of vision-driven priors, enriching the informative quality of embeddings by combining 2D image features with learned attention weights.
Robustness Analysis: The model is subjected to diverse robustness tests, including evaluations under extrinsics noise, camera miss, and camera delay conditions. This thorough analysis ensures the framework’s reliability across various real-world sensor errors and imperfections.

Performance Evaluation

PETRv2 demonstrates state-of-the-art performance in 3D object detection, BEV segmentation, and 3D lane detection, as evaluated on the nuScenes and OpenLane datasets. Notably, the incorporation of temporal modeling significantly reduces velocity error, an aspect critical for applications in autonomous driving.

3D Object Detection: Achieves competitive metrics with improved NDS and mAP through effective temporal alignment.
BEV Segmentation: Excels in segmentation tasks with notable enhancements in driveable area, lane, and vehicle categories.
3D Lane Detection: Surpasses previous models in F1-score and category-specific accuracy, indicating robust lane recognition capabilities.

Implications and Future Work

The paper positions PETRv2 as a robust and unified framework capable of advancing the boundaries of 3D perception systems. The implications of PETRv2 are significant for autonomous driving technologies, where comprehensive and accurate perception of the environment is paramount.

Future developments can explore augmentations in large-scale pretraining, integration of additional 3D vision tasks, and potential multi-modal fusion strategies to further enhance autonomous systems. PETRv2’s flexible architecture also suggests adaptability for broader applications beyond autonomous driving, setting a foundation for evolving 3D perception research.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-research/PETR: [ECCV2022] PETR: Position Embedding Transformation for Multi-View 3D Object Detection & [ICCV2023] PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images (871 stars)