UAVBench_MCQ: UAV Reasoning MCQ Benchmark

Updated 21 November 2025

UAVBench_MCQ is a reasoning-focused benchmark designed to assess UAV-centric cognitive and ethical decision-making through multiple-choice questions.
It employs a structured LLM-driven scenario generation process combining mission details, airspace, weather, UAV configuration, and rigorous safety validations to simulate realistic contexts.
The benchmark integrates a diversified MCQ taxonomy across ten reasoning axes with reproducible performance metrics to evaluate agentic intelligence in autonomous aerial systems.

UAVBench_MCQ is a reasoning-focused multiple-choice question (MCQ) benchmark derived from the UAVBench dataset, which itself comprises 50,000 validated UAV flight scenarios generated via taxonomy-guided LLM prompting and multi-stage safety validation. This benchmark is specifically built for the interpretable and machine-checkable assessment of UAV-centric cognitive and ethical reasoning under realistic operational contexts. UAVBench_MCQ is organized around a rigorous scenario schema, a diversified MCQ taxonomy spanning ten major reasoning axes, and a set of reproducible performance metrics suitable for evaluating agentic intelligence in autonomous aerial systems (Ferrag et al., 14 Nov 2025).

1. Scenario Generation and Validation Framework

UAVBench_MCQ scenarios originate from a structured LLM-driven generation process over five scenario axes: mission category (𝒞_S), airspace type (𝒞_A), weather token (𝒞_E), UAV family (𝒞_U), and a random nonce. Each scenario is encoded in a unified JSON schema containing key fields such as mission objectives, environmental and airspace constraints, UAV configuration, spawn/waypoint definitions, entity lists (traffic, obstacles, swarms), fault injection parameters, control/action sets, and risk annotations.

Quality assurance is performed via a four-stage multi-pass pipeline:

Schema and Type Checks: Enforces presence and type correctness for all mandatory fields.
Operational Constraints: Ensures mission-type constraints match vehicle, airspace, and weather rules.
Geometric Consistency: Validates that all waypoints reside within geofence and permitted altitude bounds.
Safety Validation: Evaluates pairwise separation (d_ij ≥ d_min), time-to-collision (TTC_ij ≥ TTC_min), and calibrates fault timing and severity.

Each validated scenario is labeled with a discrete risk level (ρ ∈ {0,1,2,3}) and a domain-specific risk category (σ ∈ {Weather, Navigation, Energy, Collision-Avoidance}), quantifying the dominant operational stressor.

2. Unified Scenario Schema and Data Organization

Each UAVBench_MCQ scenario instance adheres to a detailed JSON schema organizing all relevant operational data:

Field	Description (Condensed)	Example
name	Scenario identifier	"inspect_bridge_042"
seed	Randomization seed	271828
sim	Simulation properties (dt, N, f_c)	{dt: 0.02, N: 1000, f_c: 20}
uav	Vehicle config: type, mass, battery, sensors, etc.	{type: "quadrotor", mass: 3.2, ...}
environment	Weather and EMI/jamming parameters	{weather: {...}, jamming_dBm: -∞}
airspace	Limits and geofence coordinates	{h_min: 10.0, h_max: 120.0, ...}
mission	Task type, waypoints, pattern, time budget	{type: "inspection", waypoints: [...]}
entities	Traffic, obstacles, swarm definitions	{traffic: [...], obstacles: [...]}
safety	Separation and collision constraints	{d_min: 5.0, TTC_min: 2.0}
faults	List of injected faults and severities	[{t0: 650, type: "icing", sev: 2}, ...]
risk	Risk level and category	{level: 1, category: "Weather"}

This structure enables scenario diversity, risk stratification, and supports fine-grained reasoning across heterogeneous UAV operational domains.

3. MCQ Taxonomy and Question Styles

UAVBench_MCQ implements an extensive taxonomy with ten reasoning styles to probe distinct cognitive faculties of AI agents:

ID	Style Name	Reasoning Focus
1	Aerodynamics & Physics	Flight mechanics, dynamics
2	Navigation & Path Planning	Waypoint selection, planning
3	Mission Policy & Compliance	Rule/guideline adherence
4	Environmental Sensing & Fusion	Multi-sensor integration
5	Multi-Agent Coordination	Swarm interaction, separation
6	Cyber-Physical Security	Jamming, spoofing resilience
7	Energy & Resource Management	Power, battery, fuel tradeoffs
8	Ethical & Safety Decisions	Risk, trade-offs, governance
9	Comparative System Reasoning	System/algorithm comparison
10	Hybrid Integrated Reasoning	Multi-style composition

Each MCQ encodes scenario context, reasoning style, prompt, 4–7 distractor options, ground-truth answer, short explanation, and associated risk level, enabling interpretable and style-balanced evaluation.

4. Evaluation Metrics and Formalisms

UAVBench_MCQ quantifies model performance via several metrics:

Overall Accuracy: $\text{Accuracy} = (1/N) \sum_{i=1}^N 1(\hat{y}_i = y_i)$ over all questions.
Per-Style Accuracy: $a_s =$ fraction of correct responses in style $s$ .
Mean Accuracy: $\overline{a} = (1/S) \sum_{s=1}^S a_s$ , averaging across all $S=10$ styles.
Balanced Style Score (BSS): $BSS = \big(\prod_{s=1}^{S}(a_s + \epsilon)^{1/S}\big) \cdot (1 - \sigma(a) / \overline{a})$ with $\epsilon=10^{-6}$ to avoid zero divisions, and $\sigma(a)$ standard deviation of style accuracy.
Risk Quantification: $\rho(S) = \max(\rho_{\text{hazards}}(S), \rho_{\text{env}}(S))$ where each term tallies maximum fault severity or cumulative environmental stressors (wind, visibility, swarm size).

This suite captures both task-level performance and robustness to domain and risk heterogeneity.

5. Experimental Benchmarking and Results

Thirty-two state-of-the-art LLMs were evaluated zero-shot across the full 50,000 MCQ corpus (5,000 per style). Notable results include:

Styles 1 & 4 (Perception/Physics): Qwen3 235B A22B reached 89.8% mean accuracy, ChatGPT-4o 85.5%, and GPT-5 Chat 85.3%. Sensor-fusion items (>95%) surpassed raw aerodynamics (~75–82%).
Styles 2, 5, 7 (Planning/Coordination/Energy): Qwen3 235B A22B led at 76.5% avg, with planning outperforming multi-agent and energy.
Styles 3, 6, 8 (Governance/Ethics/Security): Cyber-physical security nearly saturated (95–98%), with policy and ethics lagging (65–76%).
Styles 9,10 (Systems/Integration): Comparative reasoning dominates (95–97%), hybrid integrated style lower (74–83%).
Aggregate: Highest Balanced Style Score (BSS=0.74) for Qwen3 235B A22B, demonstrating both accuracy and low style variance.

Persistent weaknesses are observed in energy/resource management, multi-agent coordination, and ethical trade-offs, reflecting current LLM training gaps.

6. Limitations and Extensions

Limitations include:

Static, Textual Modeling: MCQs evaluate snapshot reasoning, omitting temporal rollouts and continuous control dynamics.
No Multimodal Inputs: Benchmark uses parametric schemas only (no imagery, LiDAR, or real sensor logs).
Hazard Coverage: Existing faults are fixed; adversarial or dynamic hazards are absent.

Future extensions propose multimodal data (images, video), temporal reasoning benchmarks, expanded hazard and adversarial scenarios, RL baselines, human-in-the-loop validation, and real flight-log integration to bridge simulation and deployment.

7. Significance and Research Impact

UAVBench_MCQ represents the first large-scale, physically grounded, style-diversified MCQ suite for UAV agentic reasoning. By explicitly encoding scenario contexts, domain risks, and style-balanced logic, it facilitates rigorous, reproducible benchmarking for next-generation AI autonomy in aerial systems. The observed model performance trends provide actionable insight for research on robust, interpretable, and ethically sound UAV cognition, guiding future algorithmic, architectural, and policy development (Ferrag et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

UAVBench: An Open Benchmark Dataset for Autonomous and Agentic AI UAV Systems via LLM-Generated Flight Scenarios (2025)

Follow Topic

Get notified by email when new papers are published related to UAVBench_MCQ.