- The paper introduces a comprehensive aerial-ground dataset for collaborative 3D perception, featuring 120K LiDAR frames and 1.6M annotated boxes across 400 scenes.
- It defines dual benchmarks—AGC-V2V for ground-only fusion and AGC-VUC for multi-agent collaboration—demonstrating performance gains with UAV integration.
- The implementation employs precise sensor calibration and synchronization, addressing real-world challenges such as occlusion and communication delays in complex driving conditions.
Overview of AGC-Drive: Enabling Real-World Aerial-Ground Collaboration for 3D Perception in Driving
AGC-Drive introduces a comprehensive real-world dataset for collaborative 3D perception in complex driving scenarios, focusing specifically on the integration of aerial (UAV) and ground vehicle sensing. By explicitly addressing the scarcity of datasets featuring multi-agent, multimodal, and aerial-ground collaboration with high-fidelity annotations, AGC-Drive establishes a new foundation for empirical research on collaborative autonomous perception systems.
Dataset Design and Data Collection
AGC-Drive is organized around a multi-agent sensor platform comprising two vehicles and one UAV, each equipped with high-resolution LiDAR and camera systems. The ground vehicles utilize five multi-focal cameras and 128-beam LiDARs, while the UAV operates a 32-beam LiDAR and a downward-facing camera. The vehicle and UAV platforms are time-synchronized via GPS/IMU integration, with careful spatial alignment achieved through multi-modal calibration and post-hoc point cloud registration, facilitating accurate, frame-aligned global perception.
Key dataset properties include:
- Scale and Scope: ~120K LiDAR frames and 440K images across 400 scenes, each with ~100 frames and full 3D box annotations (1.6M boxes total; 13 object classes).
- Scene Diversity: Coverage of 14 scenario types, spanning urban, highway, and rural environments, including challenging cases such as roundabouts, tunnels, ramps, and events like vehicle cut-ins/outs.
- Dynamic Content: Nearly 20% of the data features high-interaction dynamics, directly supporting research on perception under complex traffic maneuvers.
- Occlusion Labelling: Each 3D box is annotated with visibility/occlusion levels, providing granular supervision for occlusion-aware perception modeling.
- Open-Source Tooling: Accompanying toolkits for calibration verification, multi-agent visualization, and collaborative annotation are publicly released.
Benchmark Structure and Tasks
The paper defines two primary benchmarks:
- AGC-V2V: Multi-vehicle cooperative perception without UAV involvement, serving as a real-world baseline for ground-level cooperative detection.
- AGC-VUC: Multi-agent collaborative perception with both vehicles and UAV, enabling evaluation of the UAV’s top-down perspective as a complement to ground sensing.
Both benchmarks adopt the OPV2V schema for data and annotation formatting, and evaluation is performed on [email protected] and [email protected], along with a metric Δ_UAV that quantifies the improvement attributable to UAV participation.
Experimental Evaluation and Results
Six representative 3D perception frameworks are benchmarked, all using PointPillars as the detection backbone:
- Lower-bound (late fusion, detection sharing)
- Upper-bound (early fusion, raw point cloud sharing)
- V2VNet (intermediate feature fusion)
- CoBEVT (sparse transformer-based BEV segmentation)
- Where2comm (communication-efficient confidence map sharing)
- V2X-ViT (transformer-based fused BEV features)
AGC-V2V performance demonstrates a large gap between early fusion (Upper-bound: 58.2% [email protected], 43.1% [email protected]) and both intermediate and late fusion methods (best intermediate: Where2comm, 34.8% and 22.5% mAP), reflecting the real-world impact of imperfect pose estimation and communication delays on collaborative fusion.
AGC-VUC results show that incorporating the UAV consistently improves performance across all fusion strategies (with Δ_UAV up to +3.3% [email protected] for Upper-bound early fusion). Notably, the improvement is observed even for communication-efficient and transformer-based methods, but the Lower-bound regression (-1.0%) highlights sensitivity to error propagation in late fusion. The qualitative analysis confirms that the aerial perspective is especially beneficial for occluded and distant objects, supporting the hypothesis that UAV data meaningfully supplements ground-based perception in complex scenes.
Implementation Considerations
AGC-Drive provides the community with both the raw data and an integrated toolkit:
- Calibration and Synchronization: Deployment of GPS/IMU-based initial alignments, refined via ICP for multi-agent LiDAR registration, followed by extrinsic camera-LiDAR calibration via PnP.
- Privacy Protection: All sensitive information, including GPS traces and human faces, is sanitized or blurred to support open data sharing.
- Computational Requirements: Each baseline model was trained on 8 Nvidia L40 GPUs for 40 epochs, with practical feasibility for both academic and industry-scale experiments (6 hours per run).
Limitations
A principal limitation lies in the sparseness of airborne LiDAR data, which—despite aiding in scene-level awareness—provides limited fine-grained object details, especially at ground level. The vertical field of view and blind zones beneath the UAV are technical bottlenecks. Future iterations are suggested to upgrade UAV sensor payloads to address this shortcoming.
Additionally, AGC-Drive deliberately retains realistic timing and registration errors to reflect operational challenges in real-world collaborative systems. This design choice renders benchmarking results as conservative estimates relative to idealized simulation-based datasets—highlighting areas where future algorithms must develop robustness to synchronization and alignment imperfections.
Implications and Future Directions
AGC-Drive constitutes a significant empirical advance towards real-world evaluation of aerial-ground collaborative perception, opening up lines of inquiry in:
- Robustness to asynchronous and misaligned multi-agent data fusion,
- Occlusion handling and long-range detection in complex, high-interaction scenarios,
- Communication-efficient feature and object sharing strategies,
- Multi-modal annotation for increasingly sophisticated perception frameworks beyond object detection (e.g., tracking, prediction, joint intention estimation).
Future development may extend towards higher-density UAV sensing, broader environmental conditions (night, adverse weather), larger fleets (multi-UAV multi-vehicle), and expanded annotation for additional perception and action tasks.
The dataset is likely to become a touchstone for research in collaborative autonomous driving perception, providing a public, high-quality, and practically constraining testbed for new algorithms capable of handling the nuanced requirements of real-world multi-agent perception and planning.