Air-Ground Cross-View Perception
- Air–ground cross-view perception is a multimodal integration framework that fuses data from ground and aerial sensors to improve detection, localization, and mapping.
- It leverages diverse sensor configurations, calibration techniques, and fusion methods to achieve robust performance in dynamic, complex environments.
- Key applications include autonomous driving, GNSS-denied localization, and digital twin development, while challenges remain in scalability and real-time privacy preservation.
Air–ground cross-view perception refers to a class of multimodal sensing, representation, and inference methodologies that fuse or align observations from ground-based (ego-vehicle, roadside, UGV) and aerial (UAV, satellite, drone) sensors. This paradigm aims to overcome the limitations of single-perspective egocentric vision and to enable robust, scalable, and context-aware scene understanding, localization, and navigation across diverse operational environments. Key applications include V2X collaborative perception for autonomous driving, GNSS-denied geo-localization, cross-domain semantic mapping, and embodied multi-agent intelligence.
1. Sensor Configurations & Calibration Paradigms
Air–ground cross-view perception systems rely on heterogeneously configured sensor suites, whose spatial and temporal alignment is fundamental to cross-domain integration. In representationally complete V2X settings, vehicle platforms are equipped with 360° 64-channel LiDAR (20 Hz, vertical FOV +10°/–38°), six surround RGB cameras (1280×720, 110° FOV each), and full 6-DOF GNSS/IMU pose tracking; RSUs employ comparable LiDAR/camera setups but lack IMU, while drones mount 360°–coverage 64-channel LiDAR (vertical FOV −30°/−90°) and a single downward-facing 110° camera with GNSS/IMU (Gao et al., 24 Jun 2025).
All agent frames are referenced to a global world frame W, with rigid-body extrinsics between any pair of sensors computed as: where , .
Intrinsic camera calibration follows a standard pinhole model: Extrinsic calibration between camera/LiDAR and between vehicles/drones utilizes checkerboard targets and minimizes re-projection errors. Planar homographies between bird’s-eye (aerial) and perspective frames are required for label/region warping.
Precise spatio-temporal synchronization at the middleware or simulation layer is now standardized via process-level integration (as in CARLA-Air, where both ground and AirSim stacks step in lock-step under a unified physics renderer) (Zeng et al., 30 Mar 2026), and in simulation environments such as TranSimHub via a globally-accessible world clock (Wang et al., 17 Oct 2025).
2. Datasets, Benchmarks, and Evaluation Protocols
The development of robust air–ground cross-view perception has accelerated due to several high-fidelity datasets:
- AirV2X-Perception: 6.73 hours of synchronized multi-agent LiDAR, camera, GNSS/IMU data across rural/urban, day/night, and diverse weather (clear, rain, fog); up to 15 agents per scene; supports 3D object detection, BEV semantic segmentation, depth estimation, and tracking (Gao et al., 24 Jun 2025).
- Griffin: 30,000 frames, 205 dynamic scenes, UAV altitudes 20–60 m, multi-weather, occlusion-aware 3D annotations, and communication-focused benchmarks (Wang et al., 10 Mar 2025).
- CVFM: 32,509 ground–aerial pairs with dense pixel-level correspondences; establishes fine-grained image matching benchmarks (Xia et al., 14 Aug 2025).
Evaluated metrics include 3D detection AP at multiple IoU levels (e.g., AP30, AP50), segmentation mean Intersection-over-Union (mIoU), AP for occluded/far-object cases, communication cost (BPS), latency tolerance (performance at 0–400 ms delay), and robustness to UAV altitude or attitude perturbations (Gao et al., 24 Jun 2025, Wang et al., 10 Mar 2025). Comparative evaluation is also performed on localization error (mean/median), recall@K for retrieval, and precision of semantic transfer or cross-modal object association.
3. Cross-View Fusion and Localization Methodologies
3.1. Collaborative Detection & Tracking
AirV2X-Perception (Gao et al., 24 Jun 2025) and Griffin (Wang et al., 10 Mar 2025) define a range of LiDAR-based fusion algorithms:
| Method | Fusion Type | Key Losses | VRAM (GB) | AP30 (%) |
|---|---|---|---|---|
| When2com | Communication graph | Learned α_{ij} weight fusion | 8.2 | 23.0 |
| Where2comm | Spatial conf. upsample | Top-K region exchange | 10.2 | 44.8 |
| CoBEVT | Sparse Transformer BEV | L_det + μ L_seg | 38.1 | 42.9 |
| V2XViT | Vision Transformer | Multi-head attention | 43.5 | 46.4 |
| HEAL | Heterogeneous Align. | L_det + μ L_seg + L_align | 12.4 | 49.2 * |
| STAMP | Adapter-reverter | Model-agnostic fusion | 10.1 | 47.9 |
(*best overall) (Gao et al., 24 Jun 2025)
Ablation studies consistently confirm that inserting aerial agents yields gains of 3–8 AP points over “vehicle+infra” settings and 15–30 points over vehicle-only; especially at night (e.g., AP30 climbs from ~32.8% to 49.2% in HEAL for night scenes with drones).
Griffin’s AGILE algorithm (Wang et al., 10 Mar 2025) enacts instance-level intermediate fusion of BEV queries and supports cross-agent temporal tracking via feature-aligned queries. This design achieves substantial AP and AMOTA gains relative to single-view or naive fusion baselines, while maintaining communication efficiency.
3.2. Cross-View Localization
Contemporary cross-view localization approaches span retrieval-based, BEV-synthesis, and 3-DoF pose estimation regimes:
- Fine-grained pose estimation: FG² (Fine-Grained Feature Matching) constructs explicit BEV point correspondences from learned vertical lifting, applies deformable attention per ground pixel, and registers the resultant 2D planes via Procrustes alignment; delivers 25–28% lower localization error versus state-of-the-art (Xia et al., 24 Mar 2025).
- BEV synthesis and windowed matching: W2W-BEV learns a window-to-window context-aware correspondence between pixel-lifted BEV features and aerial images, robust under limited-FoV and unknown orientation, achieving 17–18% higher recall than prior state-of-the-art at 90° FoV (Cheng et al., 2024).
- Surface-model-based matching: CVFM incorporates a learnable visible-region surface model and a SimRefiner neural matching module, yielding substantial improvements in both precise correspondence (e.g., 24% top-30 matches within 15 px) and localization accuracy (Xia et al., 14 Aug 2025).
- Orientation-aware Siamese representations: Dynamic Similarity Matching aligns ground–aerial feature volumes via polar transform and computes orientation offset by maximizing a 1D cross-correlation, with outsized gains for unknown orientation or narrow-FoV queries (Shi et al., 2020, Xia et al., 2023).
- Semantic particle filtering: Systems such as (Miller et al., 2022, Dixit et al., 2020) treat aerial–ground matches as odometric “sensor” measurements, fusing them with UGV dead reckoning in a probabilistic framework to achieve consistent sub-meter accuracy in online robotic deployments.
4. Semantic and Geometric Alignment
Cross-view semantic adaptation remains a central challenge due to drastic scene layout and occlusion disparities. CROVIA (Truong et al., 2023) addresses this via a Geometry-Constraint Cross-View (GeiCo) loss that enforces semantic and appearance similarity in a domain-conditioned bijective latent space, without requiring paired training images, and achieves significant mIoU improvements on SYNTHIA/UAVID and GTA5/UAVID benchmarks.
Semantic embedding is leveraged in map-level fusion, with UAV orthomaps providing elevation and class grids, and UGVs constructing dense LiDAR- or image-based semantic occupancy grids for probabilistic or keypoint-based fusion (Miller et al., 2022, Wang et al., 2023).
5. Infrastructure, Simulation, and Benchmarking Platforms
Simulation frameworks with strict air–ground synchrony and modality support are critical for generating robust, reproducible datasets:
- CARLA-Air (Zeng et al., 30 Mar 2026): Integrates urban driving (CARLA) and realistic multi-rotor flight (AirSim) in one Unreal Engine process, exposing synchronized up to 18 sensor modalities, with Python and ROS2 APIs. Enables perfectly registered cross-view annotation and benchmarking for detection, segmentation, and domain adaptation studies.
- TranSimHub (Wang et al., 17 Oct 2025): Modular three-layer (environment, simulation/control, integration) platform with Gym/ROS interfaces, causal scene editor for counterfactual experimentation, and multi-modal cross-view rendering. Provides ground-truth and annotation pipelines for 2D/3D detection, segmentation, and re-identification tasks.
- OmniVLN (Liu et al., 18 Mar 2026): Implements omnidirectional 3D mapping and hierarchical scene graphs (dynamic scene graphs) for embodied navigation, merging panoramic LiDAR and vision and providing efficient token-prompt interfaces for LLM-based instruction following across both air and ground robots.
6. Communication, Privacy, and Edge Computing
In scalable deployments, cross-view perception must operate within computation, latency, and confidentiality constraints.
- 6G SAGIN-enabled split inference: Feature partitioning across satellites, UAVs, vehicles, RSUs, and edge servers enables joint optimization of latency, energy, and privacy. A DRL-based actor–critic policy balances the offloading of intermediate features (partition K) with privacy leakage (e.g., SSIM under inversion attacks falls from ~0.84 at shallow to ~0.02 at deep splits), achieving Recall@top1 = 86.88% and AP = 61.35% with 4 UAV+4 vehicle images in the University-1652 setting (Hao et al., 12 Mar 2026).
- Communication-efficient instance fusion: AGILE (Wang et al., 10 Mar 2025) demonstrates that attention-based instance-level fusion can greatly reduce bytes-per-second compared to full early fusion, offering competitive AP performance with improved scalability.
- Opportunistic distributed communication: Field-robust air–ground semantic mapping frameworks (e.g., (Miller et al., 2022)) realize communication through ad-hoc mesh networks and “gossip” databases with robots acting as data mules, ensuring system viability in infrastructure-sparse environments.
7. Practical Impact and Open Challenges
Air–ground cross-view perception advances address occlusion sensitivity and extend spatial coverage for object detection, robust localization in GNSS-denied or rural zones, cross-agent cooperative navigation, and semantic mapping. Integration of UAV (or satellite) bird’s-eye observations closes visibility gaps of ground agents, especially in adverse weather, low-light, or obstructed settings (Gao et al., 24 Jun 2025, Wang et al., 10 Mar 2025). Cross-view methods are increasingly crucial for digital twin construction, embodied navigation, and agentic AI leveraging distributed infrastructure (Hao et al., 12 Mar 2026, Sharma et al., 7 Feb 2026).
Major outstanding challenges include scaling cross-view localization beyond city block to map-level, adapting to dynamic scenes, transferring sim-trained models to operational domains, and fully utilizing multi-modal inputs (LiDAR, RGB, inertial, semantic) under real-time and privacy constraints. Future research avenues include foundation model distillation for efficient on-edge inference, joint space–air–ground world modeling (Sharma et al., 7 Feb 2026), geometry-regularized generative synthesis for scene augmentation, privacy-preserving inference under regulatory constraints, and the co-evolution of hardware and simulation ecosystems to broaden data, annotation, and evaluation breadth (Xu et al., 26 Oct 2025).