UAVBench: UAV Benchmark Suites

Updated 21 November 2025

UAVBench is a standardized collection of benchmarks, datasets, and evaluation protocols designed to assess UAV sensing, intelligence, and autonomy tasks.
It addresses domain-specific challenges in video compression, spatial reasoning, tracking, and spectral sensing through diverse metrics like BD-Rate, mAP, and IoU.
The benchmarks drive methodological advances by enabling reproducible comparisons and fostering robust AI model development for UAV applications.

Unmanned Aerial Vehicle Benchmarks (UAVBench) represent a diverse set of standardized benchmark suites, datasets, and evaluation protocols spanning core facets of UAV sensing, intelligence, and autonomy research. These resources enable rigorous, reproducible comparison of algorithms and systems for critical UAV tasks, including perception, video compression, spatial reasoning, embodied intelligence, visual localization, vision-language navigation, multi-agent decision-making, and more. UAVBench initiatives have been pivotal in driving forward both methodological advances and the development of robust, generalizable AI models for UAV platforms.

1. UAVBench in Video Compression

The UAVBench benchmark for video compression (Jia et al., 2023) addresses the unique statistical and geometric challenges posed by UAV-captured video, such as high-altitude motion blur, non-standard perspectives, and frequent fish-eye lens artifacts. The dataset consists of 14 100-frame video sequences covering indoor/outdoor scenes, diverse altitudes, day/night, and varying crowd densities, extracted from VisDrone-SOT, VisDrone-MOT, Corridor, and UAVDT_S datasets. Resolutions range from 640×352 to 2720×1520 at 24–30 Hz. Preprocessing ensures RGB linearity and spatial dimensions compatible with convolutional encoders.

Compression performance is benchmarked under standardized coding configurations:

HEVC-SCC (HM-16.20+SCC-8.8, Main-RExt, Low-Delay-P, QP 30–42) for conventional hybrid block-based coding.
Learned codecs: OpenDVC (single-stage, MSE-trained, Optical Flow-based) and MPAI-EEV (two-stage residual, in-loop restoration, improved motion compensation), all trained on Vimeo-90K without UAV-specific fine-tuning.

Rate–distortion (R-D) analysis employs bits-per-pixel (BPP), RGB-PSNR, and BD-Rate as core metrics. MPAI-EEV delivers ≈20% BD-Rate savings over OpenDVC and ≈15% over HEVC in outdoor scenes, but underperforms HEVC by +115% BD-Rate on indoor fish-eye footage, highlighting domain shift and prediction breakdown due to radical distortion.

Key open problems identified:

Domain adaptation to drone image statistics.
Fish-eye–aware motion compensation.
Online, scene-adaptive model updating.
UAV-specific joint feature–preserving coding for downstream tasks (e.g., tracking).

The UAVBench/SpatialSky-Bench (Zhang et al., 17 Nov 2025) establishes a foundation for evaluating UAV-centric spatial intelligence, focusing on vision–LLMs (VLMs). It comprises two main categories: Environmental Perception (bounding box localization, color, distance/height estimation, spatial relationships, free-space, pointing) and Scene Understanding (captioning, functional reasoning, object counting, landing safety analysis), spanning 13 subtasks. The underlying dataset contains 1 million samples generated from UAV images augmented with LiDAR-derived depth, semantic masks, and pose metadata.

Evaluation protocols employ task-specific accuracy, IoU for boxes (threshold 0.5), L₁ error for distances and heights, BLEU and GPT-4o open-ended scoring for captions and landing analysis. Benchmarking reveals that both closed- and open-source VLMs exhibit severe deficits in spatial localization (mIoU < 5%), depth reasoning, and aerial-specific captioning, while the specialized Sky-VLM model (based on Qwen2.5-VL-7B, with supervised pretraining and RL fine-tuning) outperforms GPT-5 by +139.6% in mean aggregate score.

Structured output formats (<box>, <point> tags), multi-modal supervision, and reinforcement learning over spatial tasks are highlighted for robust UAV spatial VLM development.

3. UAVBench for Tracking, Identification, and Human Understanding

Diverse UAVBench initiatives support multi-modal, multi-object tracking (MOT), UAV re-identification (ReID), and onboard human behavior understanding at scale.

WebUAV-3M (Zhang et al., 2022): 4,500 UAV video sequences with 3.3M annotated frames, 223 target object categories, and multi-modal (vision, language, audio) annotations. Algorithms are evaluated using OPE, cAUC, mAcc, and a scenario-based protocol (UTUSC) targeting low light, occlusion, small targets, fast motion, distortions, and adversarial attacks. Top SOTA trackers include AlphaRefine and KeepTrack, but performance markedly degrades under adverse scenarios and adversarial perturbations.
UAV-ReID (Organisciak et al., 2021): Offers benchmarks for UAV re-identification with 61 unique UAVs and two settings: Temporally-Near and Big-to-Small (simulating short-term and extreme scale changes). Vision Transformers outperform CNNs under large scale variance (ViT-Base achieves mAP=46.5% in Big-to-Small), highlighting the need for global context aggregation in cross-altitude re-identification.
UAV-Human (Li et al., 2021): 67,428 multi-modal video sequences covering action recognition, pose estimation (17 keypoints), person ReID, and attribute recognition, with synchronized RGB, fisheye, and night-vision streams. GT-I3D models with guided transformer modules yield improved action recognition on heavily distorted fisheye sequences.
Anti-UAV (Jiang et al., 2021): Multi-modal (RGB/Thermal) tracking benchmark, with >580,000 boxes in 318 video pairs. The Dual-Flow Semantic Consistency (DFSC) loss enables enhanced class/instance-level modulation. Challenges include tiny, distant targets, frequent out-of-view, thermal crossover, and large modality gaps. DFSC leads to ≈0.5–1.0% gains in mSA over standard finetunes.

4. UAVBench for Visual Localization and Reasoning

For GNSS-denied environments, the UAVBench/AnyVisLoc (Ye et al., 12 Mar 2025) suite provides systematic evaluation of absolute visual localization (AVL) using 18,000 UAV images at low altitudes (30–300 m) and variable pitch (20°–90°). Reference maps cover both high-res photogrammetry (GSD ≈ 0.07 m) and satellite (GSD ≈ 0.197 m). The unified pipeline integrates global retrieval (CAMP, NetVLAD, etc.), geometric matching (Roma, SuperPoint+LightGlue), 3D lifting via DSM, and P3P+RANSAC.

Performance is reported as A@5m/10m/20m accuracy; the best configuration (CAMP+Roma Top-5 reranking) achieves 74.1% A@5m with photogrammetric maps, but only 18.5% with satellite. The PDM@K metric quantifies retrieval–localization coupling, penalizing spatial mismatch proportionally to impact on pose accuracy.

Critical factors include viewpoint/pitch, altitude, map modality, and prior information noise, exposing the brittle nature of standard retrieval methods under aerial-specific observation geometries.

5. UAVBench for Embodied Intelligence and Agentic AI

Agent-centric UAVBench initiatives advance the evaluation of embodied intelligence, interactive planning, and chain-of-task reasoning in hybrid real–virtual settings.

BEDI (Guo et al., 23 May 2025): Formalizes UAV autonomy as a dynamic perception–decision–action MDP, decomposing missions into measurable loops. Five core subskills—semantic perception, spatial perception, motion control, tool utilization, task planning—are each benchmarked with custom static and dynamic tasks using multi-modal Unreal Engine 4/AirSim environments and open HTTP APIs. Evaluation includes discrete accuracy, clock-face direction, and open-ended GPT-4o–scored planning.
AeroVerse (Yao et al., 2024): Supplies pretraining (AerialAgent-Ego10k, CyberAgent-Ego500k) and five instruction-tuning datasets for 2D/3D vision-LLMs, with GPT-4–based SkyAgent-Eval. Downstream tasks include multi-view scene awareness, spatial reasoning, navigational exploration, plan generation, and closed-loop action, in photorealistic large-scale urban simulators. Results show planning and motion decision remain unsolved; 3D-LLMs generalize poorly to outdoor settings and 2D-VLMs require multi-view context to avoid hallucination.
LLM-Driven Scenario Reasoning: UAVBench/UAVBench_MCQ (Ferrag et al., 14 Nov 2025) provides 50,000 LLM-generated, safety-validated flight scenarios in JSON, with a parallel MCQ suite probing ten cognitive styles (aerodynamics, navigation, policy, multi-agent, ethics). Risk labeling and simulation constraints are embedded. LLMs achieve ≈90% on perception/physics domains but <80% on multi-agent/energy; long-range ethical and resource-constrained decision-making are persistent gaps.
Vision-Language Navigation: OpenUAV (Wang et al., 2024) underpins a 12k-trajectory benchmark for UAV vision-language navigation with full 6-DoF physics and multi-view input. Hierarchical MLLM-based waypoint and fine path predictors outperform classical baselines (SR=16.1% vs. <9% for CMA), but human pilots perform substantially better. Challenges persist in long-horizon planning and unseen scene generalization.

6. UAVBench for Spectral Remote Sensing and Detection

UAVBench (Lekhak et al., 3 Oct 2025) extends to spectral intelligence with a large, open VNIR hyperspectral dataset for landmine/UXO detection. Data are acquired using a Headwall Nano-Hyperspec sensor (~0.07 m GSD, 398–1002 nm) over a 143-object test field, fully georeferenced via RTK/GCPs. Radiometric calibration is enforced by a two-point empirical line method against SVC spectroradiometer panel measurements. Validation shows RMSE < 1% and SAM = 1°–6°, supporting research in spectral target/anomaly and multi-modal fusion (HSI+EMI). Benchmark protocols recommend reporting RMSE/SAM against all 143 GT objects, and careful split documentation for reproducibility.

7. Significance and Open Problems

UAVBench frameworks collectively address major gaps in evaluation for:

UAV-specific artifacts (motion blur, fish-eye, scale/extreme angle challenges).
UAV spatial reasoning (3D localization, landing safety, dynamic task decomposition).
Embodied agent and VLM alignment under resource constraints and uncertainties.
Generalization across aerial image domains, tasks, and sensor modalities.
Integration with physically grounded simulation, risk quantification, scenario diversity, and multi-modal data streams.

Major open directions include domain-adaptive and transfer learning for compression and perception, robust spatially grounded VLMs, first-person egocentric agent models, memory-enhanced and hierarchical planners, and harmonized multi-sensor (RGB, HSI, EMI, LiDAR) fusion for detection and decision making. Benchmarks such as UAVBench now provide the concrete measurement tools and data to drive progress in these domains.