MM-UAVBench: Multimodal UAV Benchmark Suite

Updated 5 January 2026

MM-UAVBench is a comprehensive suite of benchmarks and datasets assessing UAV multimodal perception, cognition, planning, tracking, and registration.
It covers diverse tasks such as multimodal image registration, 3D perception from synthetic multi-sensor data, and robust tracking under challenging conditions.
Evaluations use metrics like mAP, MOTA, and IoU to deliver actionable insights for improving UAV autonomy and multi-agent coordination.

MM-UAVBench encompasses a family of benchmarks and datasets targeting the multimodal perception, cognition, planning, registration, tracking, and embodied intelligence of Unmanned Aerial Vehicles (UAVs) under complex, real-world and simulated conditions. It includes comprehensive collections for evaluating state-of-the-art models and algorithms on key challenges unique to low-altitude aerial platforms, such as multimodal fusion, small-object detection, multi-agent collaboration, navigation, and robust multi-view reasoning. Several distinct benchmarks bear the MM-UAVBench name or serve closely related roles, notably those in perception/cognition/planning for LLMs, multimodal image registration, 3D perception, multi-UAV task assignment, and tracking. Each addresses a critical methodological and application gap in UAV research.

1. Benchmark Taxonomy and Objectives

MM-UAVBench benchmarks are designed to systematically measure fundamental capabilities of models in UAV-centric multimodal scenarios, organized as follows:

Perception, Cognition, Planning for Multimodal LLMs: Evaluation of MLLMs on ~5,700 Q&A pairs covering perception (scene, object, state classification; OCR; counting), cognition (cross-object reasoning, intent prediction, damage assessment, event tracing), and planning (swarm or air-ground collaborative).
Multimodal Image Registration: Pixel-level alignment and fusion of visible and IR UAV imagery under varying flight attitudes, weather, and illumination.
3D Multi-UAV Perception: Simulation-based evaluation of model performance on detection, tracking, pose estimation, and short-term trajectory forecasting, enabled by rich synthetic multi-sensor streams (RGB, IR, LiDAR, radar, DVS).
Multi-UAV Task Assignment: Benchmarks for combinatorial assignment and path planning under constrained team orienteering, implemented with metaheuristics.
Robust Multi-Modal Tracking and Detection: Real and synthetic datasets for visual or thermal detection/tracking, cross-modality analysis, and multi-modal fusion.

These resources collectively address deficits in prior benchmarks: lack of multimodal coverage, absence of unified cognitive/planning evaluation, insufficient small-target or multi-agent data, and limited robustness to environmental variability (Dai et al., 29 Dec 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhao et al., 8 Mar 2025).

2. Core Datasets and Annotation Protocols

A spectrum of data sources and annotation paradigms is employed for MM-UAVBench:

Low-Altitude Perception/Cognition/Planning (Dai et al., 29 Dec 2025):
- 1,549 UAV video clips + 2,873 images from diverse public datasets (Visdrone, ERA, AIDER, UAVid, etc.).
- 5,702 MCQs spanning 19 subtasks; 7,496 bboxes spanning human, object, region (averaging 0.2–4.5% area coverage).
- Human-in-the-loop annotation with hard distractor synthesis; numeric ground truth perturbation for distractors in counting/damage tasks.
Multimodal Registration (Bin et al., 28 Jul 2025):
- 7,969 triplets of raw visible (1920×1080), infrared (640×512), and registered visible images (warped to IR frame).
- Six attribute labels per sample: altitude (80–300m), camera angle (0–75°), time-of-day (dawn-night), weather (sun, rain, fog), illumination, scene type (11 categories).
- Four-stage annotation: keyframe selection, manual sync, coarse warping, fine automatic refinement (SIFT/ORB+RANSAC, mutual info, optical flow).
Synthetic 3D Multi-Modal (Zou et al., 27 Nov 2025):
- 400,000 frames from Unreal Engine/Carla simulations across urban, suburban, forest, coastal scenes.
- 5 modalities: RGB, IR, LiDAR (256-ch), radar, DVS.
- Per-frame annotations: 2D/3D boxes, 6-DoF pose (translation, quaternion/Euler), instance IDs.
- Train/Val/Test splits (70/15/15%), up to 7 UAVs per scene; ≈2 million 3D boxes total.
Task Assignment (Xiao et al., 2020):
- Randomized instances on directed graphs (up to 15 UAVs, 90 targets), each with service times, rewards, and travel-time constraints.
Tracking/Detection (Jiang et al., 2021, Zhang et al., 2022, Isaac-Medina et al., 2021):
- Up to 3.3M annotated frames over 4,500 videos, 223 classes, multimodal (RGB-IR), per-frame and per-video attributes.
- Semi-automatic annotation (SiamRPN++ + human correction, SATA: 2.99s/bbox).
- Multi-modal, multi-class, long-sequence, adversarial or scenario-constraint splits.

3. Evaluation Tasks and Metrics

MM-UAVBench covers a comprehensive suite of tasks:

MLLM General Intelligence (Dai et al., 29 Dec 2025):
- Perception: scene/object classification, orientation, environment state, OCR, counting.
- Cognition: object/scene/event reasoning (backtracking, intent, damage, flow prediction, event tracing, temporal ordering).
- Planning: UAV-to-UAV coordination, air-ground route planning, multi-agent coverage.
Registration (Bin et al., 28 Jul 2025):
- Pixel-wise alignment error: $E = \frac{1}{N}\sum_{i=1}^{N}\|p^v_i - T(p^{ir}_i)\|$ .
- Mask overlap: $\mathrm{IoU} = \frac{|\Omega^v \cap T(\Omega^{ir})|}{|\Omega^v \cup T(\Omega^{ir})|}$ .
- Success rate at $E < 1$ px.
3D Perception (Zou et al., 27 Nov 2025):
- Detection: mAP on 3D-IoU (e.g., [email protected], [email protected]).
- Pose: $E_t = \|T̂-T\|_2$ , rotation error $E_r = 2 \arccos(|\langle \hat{q}, q\rangle|)$ , size error.
- Multi-object tracking: MOTA, MOTP, HOTA, IDF1.
- Trajectory forecasting: ADE, FDE.
Task Assignment (Xiao et al., 2020):
- Total reward, computational time; solution encoding ( $\epsilon, \delta$ ), flow and time consistency.
Tracking/Detection (Isaac-Medina et al., 2021, Zhang et al., 2022, Jiang et al., 2021):
- Detection: mAP (COCO [email protected]), success plots, center-error precision.
- Tracking: MOTA, state accuracy (SA), mean accuracy, cAUC (complete IoU), normalized precision.
- Scenario constraints: occlusion, low-light, small targets, distortions, adversarial attacks.
- Cross-modality: IR→RGB, RGB→IR transfer analysis.

4. Baseline Methods and Key Results

The MM-UAVBench suite provides extensive baseline evaluations:

MLLM Benchmarks (Dai et al., 29 Dec 2025):
- Human performance: 80.4% avg; best proprietary/open-source models: 54.6% and 55.4% respectively.
- Perception: 50–82%, cognition: 30–89%, planning: 22–51%.
- Bottlenecks: fine-grained counting (20–36%), small-object reasoning, multi-view fusion (accuracy delta < 0), egocentric planning.
Registration (Bin et al., 28 Jul 2025):
- Empirical: E < 1 px, IoU > 0.95 for aligned pairs.
- Example algorithms (illustrative): SuperGlue yields |E|=1.1±0.3 px, IoU=0.96, 91% success@1px.
3D Perception (Zou et al., 27 Nov 2025):
- LGFusionNet: rotation error reduced from 17.84° to 10.57°, position from 4.92m to 3.64m, [email protected] from 17.38 to 26.72.
- Trajectory: ADE/FDE for forecast horizons (1s, 3s, 5s) consistently improved over Kalman/LSTM baselines.
Task Assignment (Xiao et al., 2020):
- GA fastest, ACO best reward on large instances but slowest, PSO intermediate.
Tracking/Detection (Jiang et al., 2021, Zhang et al., 2022, Isaac-Medina et al., 2021):
- Detection: mAP up to 0.986 (Anti-UAV RGB, SSD512), cross-modality IR→RGB mAP of 0.828 (Faster RCNN), RGB→IR weaker (0.644).
- Tracking: Tracktor best MOTA (up to 0.987), KeepTrack/AlphaRefine lead cAUC/mAcc for robust tracking.
- Language/audio cues: current VLTs (+2–4% AUC/cAUC gains), still trail pure vision by ~10%.
- Transformer-based trackers (TransT): most robust under long occlusion, adversarial attacks degrade performance most severely.

5. Methodological Innovations

Question generation pipeline (UrbanVideo-Bench/MLLMs) (Zhao et al., 8 Mar 2025, Dai et al., 29 Dec 2025):
- Chain-of-thought (CoT) prompting with Gemini-1.5 Flash.
- Multi-stage curation: narration generation, structured object/motion extraction, MCQ drafting, blind filtering, human refinement (~800h).
Semantic Modulation for Single-Class Tracking (DFSC)(Jiang et al., 2021):
- Dual-flow consistency: cross-sequence class-level modulation and same-sequence instance-level refinement in a two-stage Siamese RPN+RCNN framework.
- Training-only overhead; inference matches original tracker speed.
Multimodal Fusion (Zou et al., 27 Nov 2025, Isaac-Medina et al., 2021):
- LGFusionNet (LiDAR-guided cross-branch alignment, KNN aggregation).
- Radar/depth projection fusion with 2D/3D detection heads.
Scenario-constraint Annotation (Zhang et al., 2022):
- Seven frame-wise difficulty indicators (low-light, occlusion, small target, adversarial, etc.).
- UTUSC subtest protocol for stress-testing tracker generalization.

6. Limitations and Future Directions

All MM-UAVBench resources recognize significant current limitations and chart explicit directions for extension:

Data Scale: Most datasets (except tracking benchmarks) fall below the million-frame scale, potentially limiting deep network training.
Modality diversity: Some benchmarks restricted to RGB/IR; missing LiDAR, radar, audio, etc.
Metrics: Several do not include path efficiency, continuous spatial errors, or closed-loop control metrics.
Task realism: Static MCQ benchmarks, absence of agent-environment feedback.
Environmental variation: City diversity, weather, night/dusk conditions, and dynamic airspace constraints still limited.
Model robustness: MLLMs exhibit spatial bias, failures on small objects and multi-view fusion (“1+1<2” effect).

Future work targets larger-scale, richer-modality datasets; addition of continuous-action RL, regression metrics; multi-agent coordination, dynamic obstacle avoidance; closed-loop evaluation infrastructure; stronger vision-language-audio fusion; and explicit geometric priors in MLLM training (Dai et al., 29 Dec 2025, Zhao et al., 8 Mar 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhang et al., 2022).

7. Impact and Significance in UAV AI Research

MM-UAVBench has established a baseline for quantitative, unified multimodal evaluation in UAV intelligence. By covering perception, registration, reasoning, planning, and robust multimodal tracking, it enables targeted diagnosis of failure modes—such as fine-grained counting, fusion deficits, and small-scale objects—across leading architectures, and fosters repeatable benchmarking for model innovation. The suite is foundational for the next generation of autonomous aerial systems, multimodal embodied agents, and robust multi-UAV coordination.

For original datasets, code, baselines, and further results, see (Dai et al., 29 Dec 2025, Zhao et al., 8 Mar 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhang et al., 2022, Jiang et al., 2021, Isaac-Medina et al., 2021, Xiao et al., 2020).