V2X-Seq-SPD Dataset for Urban Cooperative Driving

Updated 2 January 2026

V2X-Seq-SPD is a large-scale, real-world sequential perception dataset that integrates synchronized multi-agent sensor data from both vehicles and infrastructure for cooperative urban autonomous driving.
It provides dense temporal coverage with approximately 160,000 sequences and nearly 1 million 3D bounding-box annotations, supporting tasks like 3D detection, tracking, forecasting, and closed-loop planning.
The dataset achieves a notable reduction in missed detections through cooperative multi-view fusion and establishes unified benchmarks for both perception and planning challenges in urban intersections.

The V2X-Seq-SPD dataset is a large-scale, real-world, sequential perception dataset specifically designed for advancing vehicle-infrastructure cooperative (VIC) sensing and planning in autonomous urban driving. Released by Yu et al. and later adopted as the foundational data for the End-to-End V2X Cooperation Challenge (MEIS CVPR 2025), V2X-Seq-SPD provides temporally aligned, multi-agent sensor data—including vehicle and infrastructure streams—for research in 3D detection, tracking, forecasting, and closed-loop multi-agent planning. This dataset is unique in its combination of diverse sensors, continuous urban sequences, cooperative annotation, and explicit benchmark definitions, establishing it as the only publicly available resource supporting both cooperative perception and planning in a unified framework (Yu et al., 2023, &&&1&&&).

1. Dataset Composition and Scope

V2X-Seq-SPD captures synchronized multi-modality data at 28 urban intersections in Beijing, selected to represent a spectrum of intersection types (four-way, T-junctions, and complex crossings). Data was collected throughout 672 hours of real-world test vehicle operation, with scene fragmentation into overlapping segments resulting in approximately 80,000 vehicle-view and 80,000 infrastructure-view sequences. Each sequence spans 10 seconds and overlaps its predecessor by 5 seconds, yielding dense temporal coverage and facilitating multi-agent sequence modeling. Of these, 50,000 were selected as “cooperative-view” scenarios, emphasizing scenes where at least one target agent is within 20 meters of the ego vehicle. The full dataset encompasses on the order of 1 million 3D bounding-box annotations across eight object classes under varied lighting and occlusion conditions (Yu et al., 2023, Hao et al., 29 Jul 2025).

2. Sensor Modalities, Synchronization, and Data Formats

Sensor deployments include:

Infrastructure: At each intersection, 4–6 pairs of 300-beam spinning LiDARs (0.1°–0.2° angular resolution) and co-located high-resolution RGB cameras.
Vehicle: The ego vehicle is equipped with a 40-beam roof-top LiDAR and six wide-FOV cameras.
Additional Data: Vehicle ego-state (GNSS/INS-based position, orientation, velocity), traffic-light state per lane at the stop bar, and navigation commands.
Sensor Rates: LiDAR: 10 Hz; Cameras: typically 20–30 Hz (in the challenge dataset, camera and LiDAR both at 10 Hz).

All sensors on a given node (vehicle or RSU) are synchronized via hardware trigger, ensuring sub-millisecond temporal alignment. Extrinsic calibrations for all sensor pairs are supplied as homogeneous transforms to a common world coordinate frame: $T_{c \rightarrow w} = \begin{bmatrix} R & t \ 0 & 1 \end{bmatrix}, \quad R \in \text{SO}(3),\ t \in \mathbb{R}^3$ Raw data and annotations are organized in per-intersection, per-sequence directories, with modalities including PNG images, binary LiDAR files (XYZI), and calibration JSON/YAML files (Yu et al., 2023, Hao et al., 29 Jul 2025).

3. Annotation Schema and Quality Assurance

Annotations follow a unified, KITTI-style 3D bounding box parameterization: $[x, y, z, l, w, h, \theta]$ where $(x, y, z)$ is the box center in world coordinates, $(l, w, h)$ are object dimensions, and $\theta$ is yaw around the z-axis. Each object carries a persistent trajectory ID (across views and time) and a class label (one of eight categories: car, bus, truck, pedestrian, cyclist, motorcycle, etc.). For each vehicle, a 5-second future trajectory (sampled at 10 Hz) is annotated to support prediction and planning studies. Traffic-signal states and lane-level HD map features (lane centerlines, crosswalk polygons) are provided per frame.

Quality assurance includes second-pass human verification of occluded or truncated object labels, and cross-annotator agreement exceeding 95% on object centroids and dimensions for a statistical subset (Yu et al., 2023, Hao et al., 29 Jul 2025).

4. Benchmark Tasks and Evaluation Metrics

V2X-Seq-SPD defines benchmarks for both perception and planning:

VIC3D Tracking: Joint 3D multi-agent tracking across infrastructure and vehicle views.
- Metrics: Multiple Object Tracking Accuracy (MOTA) and IDF1:
$\text{MOTA} = 1 - \frac{\text{FP} + \text{FN} + \text{ID switches}}{\text{GT}}$

$\text{IDF1} = \frac{2 \cdot \text{IDTP}}{2 \cdot \text{IDTP} + \text{IDFP} + \text{IDFN}}$
Online-VIC Forecasting: Predicting future agent trajectories after an observation window.
- Metrics: Average Displacement Error (ADE) and Final Displacement Error (FDE):
$\text{ADE} = \frac{1}{T_\text{pred}} \sum_{t=1}^{T_\text{pred}} \| \hat{p}_t - p_t \|_2$

$\text{FDE} = \| \hat{p}_{T_\text{pred}} - p_{T_\text{pred}} \|_2$
Offline-VIC Forecasting: Fuses full past infrastructure and vehicle views for forecasting; metrics as above.
End-to-End Planning (Challenge): Evaluated by L2 waypoint error, collision rate (fraction of colliding trajectories), and off-road rate (fraction of predicted trajectory points outside drivable lanes). Overall planning score is a weighted combination: $0.5 \cdot \text{NormL2} + 0.25 \cdot \text{NormCollision} + 0.25 \cdot \text{NormOffRoad}$ .
Temporal Perception (Challenge): mAP @ IoU = 0.5 and average multi-object tracking accuracy (AMOTA), combined as $0.5 \cdot \text{mAP} + 0.5 \cdot \text{AMOTA}$ .

Baseline and top-performing solutions are established in the challenge, including UniV2X (sparse BEV fusion and transformer-based planning) and novel cooperative perception models (e.g., SparseCoop, which introduces anchor-aided queries and denoising for notable tracking improvements) (Yu et al., 2023, Hao et al., 29 Jul 2025).

5. Key Technical Features and Use Cases

A primary motivation for V2X-Seq-SPD is to address fundamental limitations in single-agent perception due to occlusion and limited range. Infrastructure-mounted LiDARs and cameras provide extended field of view and line-of-sight advantages, allowing for "see around the corner" sensing, while vehicle-mounted sensors capture proximal detail. Cooperative multi-view fusion reduces missed detections by over 40% relative to any single viewpoint alone. Additionally, static overhead infrastructure sensors capture fast-approaching vehicles further upstream, enabling longer-horizon trajectory forecasting. Cooperative annotation ensures trajectories are consistent even in complex occlusion scenarios (e.g., cyclists hidden behind buses, but visible via infrastructure views), supporting robust tracking benchmarks (Yu et al., 2023).

6. Comparisons, Strengths, and Limitations

Compared to other V2X datasets (e.g., DAIR-V2X, TUMTraf, V2v4Real, V2X-Sim), V2X-Seq-SPD exclusively comprises real-world traffic, with synchronized multi-agent data supporting both perception and closed-loop planning benchmarks in the same urban driving context. Its strengths include high-fidelity LiDAR and camera data across heterogeneous nodes (vehicles and RSUs), precise full-scene calibration, dense cooperative annotation, explicit future trajectory labels, and unified splits supporting reproducible benchmarking.

The dataset’s principal limitations are its focus on clear daytime conditions (with weather and nighttime data left for future extensions), the absence of radar modalities, and no inclusion of explicit wireless channel impairments (with communication constraints only simulated in higher-level experiments). All benchmark splits and file structures are standardized through the UniV2X framework, facilitating broad adoption and comparative research (Hao et al., 29 Jul 2025).

7. Research Opportunities and Future Directions

V2X-Seq-SPD enables novel research directions, including multi-agent 3D fusion architectures, robust calibration drift detection and self-calibration, closed-loop perception-planning loops incorporating traffic signal reasoning, and domain adaptation from simulation to reality. The dataset’s unique combination of continuous, real-world sensor data, cooperative multi-agent annotation, and joint perception-planning benchmarks positions it as a critical asset for the development and evaluation of scalable, reliable V2X-cooperative urban autonomous driving systems (Yu et al., 2023, Hao et al., 29 Jul 2025).