End-to-End V2X Cooperation Challenge

Updated 5 March 2026

End-to-end V2X cooperation is a framework unifying sensor fusion, communication constraints, and multi-agent planning for robust vehicle control.
It addresses challenges such as heterogeneous sensors, NLOS effects, dynamic agent interactions, and stringent latency and bandwidth requirements.
Current approaches leverage multi-modal fusion, relay selection, and differentiable architectures to significantly improve metrics like collision rates and detection accuracy.

End-to-end V2X cooperation refers to the challenge of jointly designing perception, communication, and planning systems across multiple agents (vehicles and infrastructure) so that information can flow efficiently, robustly, and with strict latency/reliability guarantees, all the way from raw sensor signals to vehicle control. This challenge subsumes classic modular problems—communication reliability, cooperative sensor fusion, agent selection, multi-agent planning—and integrates them in a fully differentiable or jointly optimized framework, exposing new theoretical, methodological, and system-level obstacles. The difficulty is compounded by heterogeneity in sensors/modalities, partial observability (e.g., NLOS), dynamic agents, bandwidth constraints, and varying downstream task requirements. Rigorous approaches to the end-to-end V2X cooperation problem now span formulation of new metrics, network and system architectures, joint optimization strategies, and real-world benchmarks.

1. Problem Definition and System Scope

End-to-end V2X cooperation is operationalized as the problem of producing optimal, safe, and robust decisions (e.g., future trajectory $\mathbf{Y}$ or low-level control $\mathbf{A}_t$ ) for an ego vehicle, conditioned on multi-agent, multi-modal observations communicated over V2X links under real-world constraints. Concretely, the functional mapping can be formalized as: $\mathbf{Y} = f(\{\mathbf{X}_{\rm ego},\, \mathbf{X}_{\rm infra},\, \mathbf{X}_{\rm veh},\, \mathbf{E}\};\, \theta)$ where $\mathbf{X}_{\rm ego}$ , $\mathbf{X}_{\rm infra}$ , and $\mathbf{X}_{\rm veh}$ denote the ego-vehicle, infrastructure, and other-vehicle sensors (images, LiDAR, state), $\mathbf{E}$ includes optional semantic priors or prompts, and $\theta$ are learned model parameters (You et al., 26 Jun 2025, Yu et al., 2024).

Frameworks for end-to-end V2X must specify:

Input modalities (multi-view images, LiDAR, HD maps, semantic prompts) and their synchronization across agents
Communication architecture and constraints (latency, bandwidth, error/loss, NLOS/blockage)
Fusion strategy (feature-level, instance-level, query-based, temporal), including delay handling and domain alignment
Downstream outputs (perception—3D object and semantic occupancy detection; prediction—future trajectories; planning—discrete/high-level or continuous trajectory/control sequence)
Loss functions and multi-task training, allowing gradient (or reinforcement) flow across the entire stack
Joint or modular optimization of fusion and planning, ideally with interpretable intermediate representations (Song et al., 12 Nov 2025, Liu et al., 2024).

2. Sensing, Communication, and NLOS Constraints

Sensor fusion in V2X cooperation is particularly challenged by NLOS (non-line-of-sight) effects at both the perception and the communication levels. Blind zones appear in Abstract Perception Matrices (APMs) as cells with low observed LiDAR return, reflecting occlusion by static infrastructure or dynamic obstacles (Li et al., 2023). V2X communication is itself degraded by such occluders: knife-edge diffraction models reveal that a single tall vehicle can induce an attenuation of 15–25 dB, sharply decreasing Packet Reception Rate (PRR) from ≈90% to below 40%, with a direct impact on the utility of shared data and the success of downstream cooperative fusion.

To mitigate these challenges, mobility-height hybrid relay selection (MoHeD) combines geometric (height, spatial) and dynamic (mobility similarity) features to minimize expected NLOS shadowing: $V_{\rm NLOS}(r) = \sum_{i=1}^n L_{\rm v-shadow, i} S_i$ with sensor and communication link selection rules based on real-time environment prediction (Li et al., 2023). Among possible relaying schemes, mobility-based relaying delivers the highest PRR and lowest collision rates in realistic traffic (Li et al., 2023).

3. Fusion and Matching Methodologies

Central to end-to-end V2X cooperation is the matching and fusion of perceptions across agents under bandwidth constraints. One approach is the Abstract Perception Matrix Matching (APMM) algorithm:

Identify blind-zone patches $B_{\rm ego}$ in the ego APM
For candidates $\mathbf{A}_t$ 0, spatially align APMs and compute coverage benefit
Select sharing node maximizing added coverage, subject to a benefit threshold

Complexity is $\mathbf{A}_t$ 1 for matrix sizes $\mathbf{A}_t$ 2 with sliding window $\mathbf{A}_t$ 3, typically solving in $\mathbf{A}_t$ 410 ms (Li et al., 2023).

More generally, fusion strategies are multi-level and may involve:

Perception-level fusion: instance (object or map) or dense (e.g., occupancy) queries, temporally and spatially aligned, attention-weighted and matched via assignment (e.g., Hungarian algorithm).
Prediction-level: shared multimodal trajectory queries, with weighted aggregation reflecting confidence or bandwidth-allocation.
Planning-stage: cross-modal co-attention, domain adaptation, and end-to-end differentiable decoders (Song et al., 12 Nov 2025, Yu et al., 2024, Yin et al., 17 Sep 2025).

Mixture-of-Experts (MoE) architectures further enhance representational diversity in BEV encoders and decoders, yielding significant improvements in perception (mAP +39.7%), prediction (minADE –7.2%), and planning (L2 error –33.2%) (Song et al., 12 Nov 2025).

4. Communication System Design: Technologies and Standards

The end-to-end V2X challenge is fundamentally tied to communication system design. Use-case taxonomies (cooperative awareness, cooperative sensing, cooperative maneuver, VRU, traffic efficiency, teleoperation) impose sharply varying demands: required end-to-end latency spans <3 ms (platooning) to >1 s (fleet updates), with corresponding reliability (P_succ) and per-vehicle data rate (R_b) requirements (Boban et al., 2017).

No single legacy radio access technology (RAT) meets the strictest fused requirements of low latency, high reliability, and high rate. The next-generation 5G V2X system incorporates:

Sub-6 GHz NR for ultra-reliable low-latency control and event messaging
mmWave (FR2) for high-throughput sensor sharing in LOS, short-range regimes (1 Gbps+), but subject to fading/blockage/NLOS
Visible light V2X for very short-range, high-precision positioning
Multi-access orchestration and edge computing (MEC) for system-wide planning and jamming-resilient data routing (Boban et al., 2017, Zugno et al., 2019).

Heterogeneous orchestration enables dynamic load steering across RATs, joint scheduling (e.g., beam-formed mmWave for trucks, fallback to DSRC in NLOS), and co-simulation of physical and application-layer metrics, as demonstrated in CARLA+Veins+AutoCastSim pipelines (Li et al., 2023).

5. Benchmarks, Evaluation Protocols, and Experimental Findings

Systematic evaluation is supported by large-scale, realistic simulation and real-world datasets:

V2Xverse: closed-loop urban driving, CARLA-based with realistic sensors, V2X channel models (latency, loss, pose errors), reproducible safety-critical events (Liu et al., 2024).
DAIR-V2X, V2X-Seq-SPD: multi-agent, multi-modal sequences with 3D object boxes, occupancy, and ground-truth future trajectories (Yu et al., 2024, Yang et al., 26 Dec 2025).

Challenge design distinguishes between cooperative temporal perception (mAP, AMOTA) and cooperative planning (L2 error, collision, off-road rate). For reference in DAIR-V2X-based competitions:

UniV2X framework improves planning collision rates from 0.89% (NoFusion) to 0.49%, perception mAP from 0.165 to 0.295, using hybrid sparse-dense transmission at 8.09×10⁵ B/s (Yu et al., 2024).
MAP achieves additional 16.6% L2 error reduction and 56.2% off-road drop over UniV2X, primarily via plan-guided map feature fusion and dynamic weighting (Yin et al., 17 Sep 2025).
V2X-VLM, leveraging vision-LLMs and contrastive alignment, surpasses prior methods with 1.22 m L2 error and 0.01% collision rate, at 1.24×10⁷ BPS (You et al., 2024).

Advanced architectures (LET-VIC, XET-V2X) further demonstrate the value of temporal and calibration-adaptive fusion for 3D tracking and detection, with mAP and AMOTA increases exceeding 15% over previous baselines (Yang et al., 2024, Yang et al., 26 Dec 2025).

6. System Robustness, Adaptation, and Open Research Directions

Robustness to bandwidth variation, delay, NLOS, and agent heterogeneity is a defining constraint of the end-to-end V2X challenge. Recent work employs:

Bandwidth-aware, task-adaptive fusion: optimize fusion weights under explicit BPS budget (Hao et al., 29 Jul 2025)
Delay-robust fusion: temporal attention, feature drift compensation via QueryFlowNet/OccFlowNet (Yu et al., 2024, Yang et al., 26 Dec 2025)
Instance-level sparse query transmission, graph-based cross-agent association for bandwidth minimization and instant SOTA trade-off (Zhong et al., 25 Jul 2025)
Closed-form communication scheduling maximizing the utility-weighted selection of feature regions under dynamic bandwidth (Liu et al., 2024)

Open research problems include:

Realistic simulation of variable latency, channel error, and prioritization
Adaptive codes and hierarchical feature transmission
Meta-learning for agent heterogeneity and rapid adaptation
Uncertainty quantification and interpretable cross-agent attribution
Integration of rare/long-tail scenario synthesis (e.g., via prompt-driven VLMs) for robustness (You et al., 26 Jun 2025)
Closed-loop, multi-agent training with reinforcement objectives that reflect safety and route completion (Liu et al., 2024)

End-to-end V2X cooperation thus sits at the intersection of communication theory, large-scale perception and planning, and safety-critical system evaluation, with state-of-the-art systems now approaching sub-1% collision rates, high mAP/AMOTA, and operation under tight bandwidth, delay, and NLOS regimes.

7. Summary Table: State-of-the-Art System Comparison

Approach	Perception (mAP/AMOTA)	Planning L2 Error (m)	Collision Rate (%)	Bandwidth (B/s)	Key Innovation	Reference
UniV2X	0.295 / 0.239	3.43 (4.5s horiz.)	0.49	8.09×10⁵	Hybrid sparse/dense	(Yu et al., 2024)
UniMM-V2X	0.422 / 0.427	1.49	0.12	9.32×10⁵	MoE multi-level	(Song et al., 12 Nov 2025)
MAP	-	2.56	0.96	-	Map-assisted plan	(Yin et al., 17 Sep 2025)
V2X-VLM	-	1.22	0.01	1.24×10⁷	VLM, contrastive	(You et al., 2024)
LET-VIC	0.606 / 0.640	-	-	-	LiDAR, temporal attn	(Yang et al., 2024)
CoopTrack	0.390 / 0.328	-	-	5.64×10⁴	Sparse instance, GNN	(Zhong et al., 25 Jul 2025)
CoDriving	-	-	-	(adaptive)	Driving-utility comm	(Liu et al., 2024)

This summary reflects the current state-of-the-art in end-to-end V2X system design, evaluation, and adaptation, as evidenced in recent challenge outcomes and leading research on arXiv.