V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer (2203.10638v3)

Published 20 Mar 2022 in cs.CV

Abstract: In this paper, we investigate the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles. We present a robust cooperative perception framework with V2X communication using a novel vision Transformer. Specifically, we build a holistic attention model, namely V2X-ViT, to effectively fuse information across on-road agents (i.e., vehicles and infrastructure). V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention, which captures inter-agent interaction and per-agent spatial relationships. These key modules are designed in a unified Transformer architecture to handle common V2X challenges, including asynchronous information sharing, pose errors, and heterogeneity of V2X components. To validate our approach, we create a large-scale V2X perception dataset using CARLA and OpenCDA. Extensive experimental results demonstrate that V2X-ViT sets new state-of-the-art performance for 3D object detection and achieves robust performance even under harsh, noisy environments. The code is available at https://github.com/DerrickXuNu/v2x-vit.

Citations (286)

View on Semantic Scholar

Summary

The paper introduces V2X-ViT, a unified Vision Transformer framework that integrates heterogeneous V2X data for improved 3D object detection.
It employs a novel HMSA and multi-scale window attention to adaptively fuse multi-agent signals and handle spatial misalignments.
The work presents the V2XSet dataset for realistic evaluations and demonstrates significant performance gains, enhancing AV safety and efficiency.

Overview of "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer"

The paper presents a novel framework for enhancing the perception capabilities of autonomous vehicles (AVs) through Vehicle-to-Everything (V2X) communication utilizing a Vision Transformer. The proposed framework, called V2X-ViT, aims to improve 3D object detection by effectively integrating information from multiple on-road agents, including both vehicles and critical infrastructure. This integration addresses several challenges inherent in V2X systems, such as asynchronous information sharing, pose localization errors, and heterogeneity among V2X components.

Technical Contributions

Heterogeneous Multi-Agent Self-Attention Module (HMSA): This module is designed to handle the diversity in V2X systems by explicitly modeling the relationships between different types of agents. It enables adaptive information fusion by employing a unique attention mechanism that distinguishes between vehicle-vehicle, vehicle-infrastructure, infrastructure-vehicle, and infrastructure-infrastructure interactions.
Multi-Scale Window Attention (MSwin): To address spatial misalignments and localization errors, MSwin captures local and global spatial feature interactions by employing multiple window scales. This approach enhances robustness against spatial inaccuracies common in V2X systems.
Unified Transformer Architecture: The integration of HMSA and MSwin into a single Vision Transformer framework allows end-to-end learning, efficiently handling the complexities and noise associated with cooperative perception tasks.
Dataset Creation (V2XSet): A new large-scale simulation dataset, V2XSet, that includes realistic noise modes such as pose inaccuracies and time delays, was constructed leveraging CARLA and OpenCDA. This dataset supports a broad evaluation of cooperative perception systems under various conditions and road types.

Quantitative and Qualitative Results

The experimental results demonstrate that V2X-ViT sets a new standard in V2X perception, achieving remarkable performance gains in 3D object detection under both perfect and noisy conditions. In comparison to existing methods, V2X-ViT showed improvements of 21.2% in AP on average compared to single-agent approaches. Furthermore, it outperformed state-of-the-art intermediate fusion methods by a minimum of 7.3%. The robustness of V2X-ViT to localization errors and time delays was also validated, showing significantly less degradation in performance relative to traditional early and late fusion methods under adverse conditions.

Implications and Future Directions

Practically, the adoption of V2X-ViT could significantly enhance AV safety and efficiency, given its superior ability to accurately detect obstacles by integrating more comprehensive environmental data from various sources. Theoretically, V2X-ViT contributes to the understanding of heterogeneity in cooperative perception networks and offers a scalable solution for real-world deployment.

In anticipation of future advancements, V2X-ViT could be extended to support multi-sensor data fusion, combining vision, LiDAR, radar, and other sensory inputs, leading to a more holistic perception model. Future work could also explore its performance and generalization in real-world decentralized systems, optimizing bandwidth usage, and ensuring secure data sharing to tackle privacy and adversarial concerns.

Such enhancements will be critical as the field progresses towards fully autonomous driving systems capable of navigating complex and dynamic real-world environments. The focus on overcoming practical deployment challenges while maintaining high performance positions V2X-ViT as a significant step forward in the development of robust cooperative perception frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - DerrickXuNu/v2x-vit: [ECCV2022] Official Implementation of paper "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer" (277 stars)