UAVBench_MCQ: UAV Reasoning MCQ Benchmark
- UAVBench_MCQ is a reasoning-focused benchmark designed to assess UAV-centric cognitive and ethical decision-making through multiple-choice questions.
- It employs a structured LLM-driven scenario generation process combining mission details, airspace, weather, UAV configuration, and rigorous safety validations to simulate realistic contexts.
- The benchmark integrates a diversified MCQ taxonomy across ten reasoning axes with reproducible performance metrics to evaluate agentic intelligence in autonomous aerial systems.
UAVBench_MCQ is a reasoning-focused multiple-choice question (MCQ) benchmark derived from the UAVBench dataset, which itself comprises 50,000 validated UAV flight scenarios generated via taxonomy-guided LLM prompting and multi-stage safety validation. This benchmark is specifically built for the interpretable and machine-checkable assessment of UAV-centric cognitive and ethical reasoning under realistic operational contexts. UAVBench_MCQ is organized around a rigorous scenario schema, a diversified MCQ taxonomy spanning ten major reasoning axes, and a set of reproducible performance metrics suitable for evaluating agentic intelligence in autonomous aerial systems (Ferrag et al., 14 Nov 2025).
1. Scenario Generation and Validation Framework
UAVBench_MCQ scenarios originate from a structured LLM-driven generation process over five scenario axes: mission category (𝒞_S), airspace type (𝒞_A), weather token (𝒞_E), UAV family (𝒞_U), and a random nonce. Each scenario is encoded in a unified JSON schema containing key fields such as mission objectives, environmental and airspace constraints, UAV configuration, spawn/waypoint definitions, entity lists (traffic, obstacles, swarms), fault injection parameters, control/action sets, and risk annotations.
Quality assurance is performed via a four-stage multi-pass pipeline:
- Schema and Type Checks: Enforces presence and type correctness for all mandatory fields.
- Operational Constraints: Ensures mission-type constraints match vehicle, airspace, and weather rules.
- Geometric Consistency: Validates that all waypoints reside within geofence and permitted altitude bounds.
- Safety Validation: Evaluates pairwise separation (d_ij ≥ d_min), time-to-collision (TTC_ij ≥ TTC_min), and calibrates fault timing and severity.
Each validated scenario is labeled with a discrete risk level (ρ ∈ {0,1,2,3}) and a domain-specific risk category (σ ∈ {Weather, Navigation, Energy, Collision-Avoidance}), quantifying the dominant operational stressor.
2. Unified Scenario Schema and Data Organization
Each UAVBench_MCQ scenario instance adheres to a detailed JSON schema organizing all relevant operational data:
| Field | Description (Condensed) | Example |
|---|---|---|
| name | Scenario identifier | "inspect_bridge_042" |
| seed | Randomization seed | 271828 |
| sim | Simulation properties (dt, N, f_c) | {dt: 0.02, N: 1000, f_c: 20} |
| uav | Vehicle config: type, mass, battery, sensors, etc. | {type: "quadrotor", mass: 3.2, ...} |
| environment | Weather and EMI/jamming parameters | {weather: {...}, jamming_dBm: -∞} |
| airspace | Limits and geofence coordinates | {h_min: 10.0, h_max: 120.0, ...} |
| mission | Task type, waypoints, pattern, time budget | {type: "inspection", waypoints: [...]} |
| entities | Traffic, obstacles, swarm definitions | {traffic: [...], obstacles: [...]} |
| safety | Separation and collision constraints | {d_min: 5.0, TTC_min: 2.0} |
| faults | List of injected faults and severities | [{t0: 650, type: "icing", sev: 2}, ...] |
| risk | Risk level and category | {level: 1, category: "Weather"} |
This structure enables scenario diversity, risk stratification, and supports fine-grained reasoning across heterogeneous UAV operational domains.
3. MCQ Taxonomy and Question Styles
UAVBench_MCQ implements an extensive taxonomy with ten reasoning styles to probe distinct cognitive faculties of AI agents:
| ID | Style Name | Reasoning Focus |
|---|---|---|
| 1 | Aerodynamics & Physics | Flight mechanics, dynamics |
| 2 | Navigation & Path Planning | Waypoint selection, planning |
| 3 | Mission Policy & Compliance | Rule/guideline adherence |
| 4 | Environmental Sensing & Fusion | Multi-sensor integration |
| 5 | Multi-Agent Coordination | Swarm interaction, separation |
| 6 | Cyber-Physical Security | Jamming, spoofing resilience |
| 7 | Energy & Resource Management | Power, battery, fuel tradeoffs |
| 8 | Ethical & Safety Decisions | Risk, trade-offs, governance |
| 9 | Comparative System Reasoning | System/algorithm comparison |
| 10 | Hybrid Integrated Reasoning | Multi-style composition |
Each MCQ encodes scenario context, reasoning style, prompt, 4–7 distractor options, ground-truth answer, short explanation, and associated risk level, enabling interpretable and style-balanced evaluation.
4. Evaluation Metrics and Formalisms
UAVBench_MCQ quantifies model performance via several metrics:
- Overall Accuracy: over all questions.
- Per-Style Accuracy: fraction of correct responses in style .
- Mean Accuracy: , averaging across all styles.
- Balanced Style Score (BSS): with to avoid zero divisions, and standard deviation of style accuracy.
- Risk Quantification: where each term tallies maximum fault severity or cumulative environmental stressors (wind, visibility, swarm size).
This suite captures both task-level performance and robustness to domain and risk heterogeneity.
5. Experimental Benchmarking and Results
Thirty-two state-of-the-art LLMs were evaluated zero-shot across the full 50,000 MCQ corpus (5,000 per style). Notable results include:
- Styles 1 & 4 (Perception/Physics): Qwen3 235B A22B reached 89.8% mean accuracy, ChatGPT-4o 85.5%, and GPT-5 Chat 85.3%. Sensor-fusion items (>95%) surpassed raw aerodynamics (~75–82%).
- Styles 2, 5, 7 (Planning/Coordination/Energy): Qwen3 235B A22B led at 76.5% avg, with planning outperforming multi-agent and energy.
- Styles 3, 6, 8 (Governance/Ethics/Security): Cyber-physical security nearly saturated (95–98%), with policy and ethics lagging (65–76%).
- Styles 9,10 (Systems/Integration): Comparative reasoning dominates (95–97%), hybrid integrated style lower (74–83%).
- Aggregate: Highest Balanced Style Score (BSS=0.74) for Qwen3 235B A22B, demonstrating both accuracy and low style variance.
Persistent weaknesses are observed in energy/resource management, multi-agent coordination, and ethical trade-offs, reflecting current LLM training gaps.
6. Limitations and Extensions
Limitations include:
- Static, Textual Modeling: MCQs evaluate snapshot reasoning, omitting temporal rollouts and continuous control dynamics.
- No Multimodal Inputs: Benchmark uses parametric schemas only (no imagery, LiDAR, or real sensor logs).
- Hazard Coverage: Existing faults are fixed; adversarial or dynamic hazards are absent.
Future extensions propose multimodal data (images, video), temporal reasoning benchmarks, expanded hazard and adversarial scenarios, RL baselines, human-in-the-loop validation, and real flight-log integration to bridge simulation and deployment.
7. Significance and Research Impact
UAVBench_MCQ represents the first large-scale, physically grounded, style-diversified MCQ suite for UAV agentic reasoning. By explicitly encoding scenario contexts, domain risks, and style-balanced logic, it facilitates rigorous, reproducible benchmarking for next-generation AI autonomy in aerial systems. The observed model performance trends provide actionable insight for research on robust, interpretable, and ethically sound UAV cognition, guiding future algorithmic, architectural, and policy development (Ferrag et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free