Papers
Topics
Authors
Recent
2000 character limit reached

MM-UAVBench: Multimodal UAV Benchmark Suite

Updated 5 January 2026
  • MM-UAVBench is a comprehensive suite of benchmarks and datasets assessing UAV multimodal perception, cognition, planning, tracking, and registration.
  • It covers diverse tasks such as multimodal image registration, 3D perception from synthetic multi-sensor data, and robust tracking under challenging conditions.
  • Evaluations use metrics like mAP, MOTA, and IoU to deliver actionable insights for improving UAV autonomy and multi-agent coordination.

MM-UAVBench encompasses a family of benchmarks and datasets targeting the multimodal perception, cognition, planning, registration, tracking, and embodied intelligence of Unmanned Aerial Vehicles (UAVs) under complex, real-world and simulated conditions. It includes comprehensive collections for evaluating state-of-the-art models and algorithms on key challenges unique to low-altitude aerial platforms, such as multimodal fusion, small-object detection, multi-agent collaboration, navigation, and robust multi-view reasoning. Several distinct benchmarks bear the MM-UAVBench name or serve closely related roles, notably those in perception/cognition/planning for LLMs, multimodal image registration, 3D perception, multi-UAV task assignment, and tracking. Each addresses a critical methodological and application gap in UAV research.

1. Benchmark Taxonomy and Objectives

MM-UAVBench benchmarks are designed to systematically measure fundamental capabilities of models in UAV-centric multimodal scenarios, organized as follows:

  • Perception, Cognition, Planning for Multimodal LLMs: Evaluation of MLLMs on ~5,700 Q&A pairs covering perception (scene, object, state classification; OCR; counting), cognition (cross-object reasoning, intent prediction, damage assessment, event tracing), and planning (swarm or air-ground collaborative).
  • Multimodal Image Registration: Pixel-level alignment and fusion of visible and IR UAV imagery under varying flight attitudes, weather, and illumination.
  • 3D Multi-UAV Perception: Simulation-based evaluation of model performance on detection, tracking, pose estimation, and short-term trajectory forecasting, enabled by rich synthetic multi-sensor streams (RGB, IR, LiDAR, radar, DVS).
  • Multi-UAV Task Assignment: Benchmarks for combinatorial assignment and path planning under constrained team orienteering, implemented with metaheuristics.
  • Robust Multi-Modal Tracking and Detection: Real and synthetic datasets for visual or thermal detection/tracking, cross-modality analysis, and multi-modal fusion.

These resources collectively address deficits in prior benchmarks: lack of multimodal coverage, absence of unified cognitive/planning evaluation, insufficient small-target or multi-agent data, and limited robustness to environmental variability (Dai et al., 29 Dec 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhao et al., 8 Mar 2025).

2. Core Datasets and Annotation Protocols

A spectrum of data sources and annotation paradigms is employed for MM-UAVBench:

  • Low-Altitude Perception/Cognition/Planning (Dai et al., 29 Dec 2025):
    • 1,549 UAV video clips + 2,873 images from diverse public datasets (Visdrone, ERA, AIDER, UAVid, etc.).
    • 5,702 MCQs spanning 19 subtasks; 7,496 bboxes spanning human, object, region (averaging 0.2–4.5% area coverage).
    • Human-in-the-loop annotation with hard distractor synthesis; numeric ground truth perturbation for distractors in counting/damage tasks.
  • Multimodal Registration (Bin et al., 28 Jul 2025):
    • 7,969 triplets of raw visible (1920×1080), infrared (640×512), and registered visible images (warped to IR frame).
    • Six attribute labels per sample: altitude (80–300m), camera angle (0–75°), time-of-day (dawn-night), weather (sun, rain, fog), illumination, scene type (11 categories).
    • Four-stage annotation: keyframe selection, manual sync, coarse warping, fine automatic refinement (SIFT/ORB+RANSAC, mutual info, optical flow).
  • Synthetic 3D Multi-Modal (Zou et al., 27 Nov 2025):
    • 400,000 frames from Unreal Engine/Carla simulations across urban, suburban, forest, coastal scenes.
    • 5 modalities: RGB, IR, LiDAR (256-ch), radar, DVS.
    • Per-frame annotations: 2D/3D boxes, 6-DoF pose (translation, quaternion/Euler), instance IDs.
    • Train/Val/Test splits (70/15/15%), up to 7 UAVs per scene; ≈2 million 3D boxes total.
  • Task Assignment (Xiao et al., 2020):
    • Randomized instances on directed graphs (up to 15 UAVs, 90 targets), each with service times, rewards, and travel-time constraints.
  • Tracking/Detection (Jiang et al., 2021, Zhang et al., 2022, Isaac-Medina et al., 2021):
    • Up to 3.3M annotated frames over 4,500 videos, 223 classes, multimodal (RGB-IR), per-frame and per-video attributes.
    • Semi-automatic annotation (SiamRPN++ + human correction, SATA: 2.99s/bbox).
    • Multi-modal, multi-class, long-sequence, adversarial or scenario-constraint splits.

3. Evaluation Tasks and Metrics

MM-UAVBench covers a comprehensive suite of tasks:

  • MLLM General Intelligence (Dai et al., 29 Dec 2025):
    • Perception: scene/object classification, orientation, environment state, OCR, counting.
    • Cognition: object/scene/event reasoning (backtracking, intent, damage, flow prediction, event tracing, temporal ordering).
    • Planning: UAV-to-UAV coordination, air-ground route planning, multi-agent coverage.
  • Registration (Bin et al., 28 Jul 2025):
    • Pixel-wise alignment error: E=1Ni=1NpivT(piir)E = \frac{1}{N}\sum_{i=1}^{N}\|p^v_i - T(p^{ir}_i)\|.
    • Mask overlap: IoU=ΩvT(Ωir)ΩvT(Ωir)\mathrm{IoU} = \frac{|\Omega^v \cap T(\Omega^{ir})|}{|\Omega^v \cup T(\Omega^{ir})|}.
    • Success rate at E<1E < 1 px.
  • 3D Perception (Zou et al., 27 Nov 2025):
    • Detection: mAP on 3D-IoU (e.g., [email protected], [email protected]).
    • Pose: Et=T^T2E_t = \|T̂-T\|_2, rotation error Er=2arccos(q^,q)E_r = 2 \arccos(|\langle \hat{q}, q\rangle|), size error.
    • Multi-object tracking: MOTA, MOTP, HOTA, IDF1.
    • Trajectory forecasting: ADE, FDE.
  • Task Assignment (Xiao et al., 2020):
    • Total reward, computational time; solution encoding (ϵ,δ\epsilon, \delta), flow and time consistency.
  • Tracking/Detection (Isaac-Medina et al., 2021, Zhang et al., 2022, Jiang et al., 2021):
    • Detection: mAP (COCO [email protected]), success plots, center-error precision.
    • Tracking: MOTA, state accuracy (SA), mean accuracy, cAUC (complete IoU), normalized precision.
    • Scenario constraints: occlusion, low-light, small targets, distortions, adversarial attacks.
    • Cross-modality: IR→RGB, RGB→IR transfer analysis.

4. Baseline Methods and Key Results

The MM-UAVBench suite provides extensive baseline evaluations:

  • MLLM Benchmarks (Dai et al., 29 Dec 2025):
    • Human performance: 80.4% avg; best proprietary/open-source models: 54.6% and 55.4% respectively.
    • Perception: 50–82%, cognition: 30–89%, planning: 22–51%.
    • Bottlenecks: fine-grained counting (20–36%), small-object reasoning, multi-view fusion (accuracy delta < 0), egocentric planning.
  • Registration (Bin et al., 28 Jul 2025):
    • Empirical: E < 1 px, IoU > 0.95 for aligned pairs.
    • Example algorithms (illustrative): SuperGlue yields |E|=1.1±0.3 px, IoU=0.96, 91% success@1px.
  • 3D Perception (Zou et al., 27 Nov 2025):
    • LGFusionNet: rotation error reduced from 17.84° to 10.57°, position from 4.92m to 3.64m, [email protected] from 17.38 to 26.72.
    • Trajectory: ADE/FDE for forecast horizons (1s, 3s, 5s) consistently improved over Kalman/LSTM baselines.
  • Task Assignment (Xiao et al., 2020):
    • GA fastest, ACO best reward on large instances but slowest, PSO intermediate.
  • Tracking/Detection (Jiang et al., 2021, Zhang et al., 2022, Isaac-Medina et al., 2021):
    • Detection: mAP up to 0.986 (Anti-UAV RGB, SSD512), cross-modality IR→RGB mAP of 0.828 (Faster RCNN), RGB→IR weaker (0.644).
    • Tracking: Tracktor best MOTA (up to 0.987), KeepTrack/AlphaRefine lead cAUC/mAcc for robust tracking.
    • Language/audio cues: current VLTs (+2–4% AUC/cAUC gains), still trail pure vision by ~10%.
    • Transformer-based trackers (TransT): most robust under long occlusion, adversarial attacks degrade performance most severely.

5. Methodological Innovations

  • Question generation pipeline (UrbanVideo-Bench/MLLMs) (Zhao et al., 8 Mar 2025, Dai et al., 29 Dec 2025):
    • Chain-of-thought (CoT) prompting with Gemini-1.5 Flash.
    • Multi-stage curation: narration generation, structured object/motion extraction, MCQ drafting, blind filtering, human refinement (~800h).
  • Semantic Modulation for Single-Class Tracking (DFSC)(Jiang et al., 2021):
    • Dual-flow consistency: cross-sequence class-level modulation and same-sequence instance-level refinement in a two-stage Siamese RPN+RCNN framework.
    • Training-only overhead; inference matches original tracker speed.
  • Multimodal Fusion (Zou et al., 27 Nov 2025, Isaac-Medina et al., 2021):
    • LGFusionNet (LiDAR-guided cross-branch alignment, KNN aggregation).
    • Radar/depth projection fusion with 2D/3D detection heads.
  • Scenario-constraint Annotation (Zhang et al., 2022):
    • Seven frame-wise difficulty indicators (low-light, occlusion, small target, adversarial, etc.).
    • UTUSC subtest protocol for stress-testing tracker generalization.

6. Limitations and Future Directions

All MM-UAVBench resources recognize significant current limitations and chart explicit directions for extension:

  • Data Scale: Most datasets (except tracking benchmarks) fall below the million-frame scale, potentially limiting deep network training.
  • Modality diversity: Some benchmarks restricted to RGB/IR; missing LiDAR, radar, audio, etc.
  • Metrics: Several do not include path efficiency, continuous spatial errors, or closed-loop control metrics.
  • Task realism: Static MCQ benchmarks, absence of agent-environment feedback.
  • Environmental variation: City diversity, weather, night/dusk conditions, and dynamic airspace constraints still limited.
  • Model robustness: MLLMs exhibit spatial bias, failures on small objects and multi-view fusion (“1+1<2” effect).

Future work targets larger-scale, richer-modality datasets; addition of continuous-action RL, regression metrics; multi-agent coordination, dynamic obstacle avoidance; closed-loop evaluation infrastructure; stronger vision-language-audio fusion; and explicit geometric priors in MLLM training (Dai et al., 29 Dec 2025, Zhao et al., 8 Mar 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhang et al., 2022).

7. Impact and Significance in UAV AI Research

MM-UAVBench has established a baseline for quantitative, unified multimodal evaluation in UAV intelligence. By covering perception, registration, reasoning, planning, and robust multimodal tracking, it enables targeted diagnosis of failure modes—such as fine-grained counting, fusion deficits, and small-scale objects—across leading architectures, and fosters repeatable benchmarking for model innovation. The suite is foundational for the next generation of autonomous aerial systems, multimodal embodied agents, and robust multi-UAV coordination.

For original datasets, code, baselines, and further results, see (Dai et al., 29 Dec 2025, Zhao et al., 8 Mar 2025, Bin et al., 28 Jul 2025, Zou et al., 27 Nov 2025, Zhang et al., 2022, Jiang et al., 2021, Isaac-Medina et al., 2021, Xiao et al., 2020).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MM-UAVBench.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube