Papers
Topics
Authors
Recent
Search
2000 character limit reached

AgentDrive-MCQ Benchmark

Updated 2 March 2026
  • AgentDrive-MCQ is a reasoning-centric multiple-choice benchmark that uses validated simulation scenarios to test LLM decision-making in autonomous driving.
  • It integrates five distinct reasoning styles—physics, policy, hybrid, interpretive, and comparative—with strict schema and quality controls.
  • The automated pipeline ensures reproducible scenario validation and MCQ generation, providing detailed metrics on safety compliance and quantitative reasoning.

AgentDrive-MCQ is a large-scale, reasoning-centric multiple-choice question (MCQ) benchmark built atop the AgentDrive autonomous driving simulation suite to assess the decision-making and reasoning capabilities of LLMs in safety-critical, structured domains. Spanning 100,000 MCQs across five compositional reasoning styles and covering diverse, simulation-grounded driving scenarios, AgentDrive-MCQ enables systematic large-scale evaluation of physics, policy, hybrid, interpretive, and comparative reasoning in agentic AI systems (Ferrag et al., 23 Jan 2026). The benchmark’s generation and evaluation pipeline integrates scenario validation, LLM-driven question synthesis, and explicit rationale generation, with stringent distributional and quality controls.

1. Design Objectives and Benchmark Scope

AgentDrive-MCQ is developed to address the lack of principled, large-scale, and safety-critical benchmarks for agentic AI systems, specifically regarding how LLMs handle structured reasoning in autonomous driving contexts. Unlike traditional scenario-only evaluation, AgentDrive-MCQ derives each question directly from validated simulation scenarios, thereby ensuring that each MCQ is grounded in concrete, physically consistent, and rule-labeled driving episodes (Ferrag et al., 23 Jan 2026). The dataset’s dual focus is on (i) cognitive and quantitative reasoning—testing LLMs’ physics competence, policy awareness, interpretive and comparative acumen, and (ii) robust, reproducible generation workflows with provenance and rigorous evaluation criteria (An, 21 Feb 2026).

2. Scenario Generation and MCQ Construction Pipeline

AgentDrive-MCQ builds its 100,000-question corpus via a three-stage, LLM-orchestrated workflow:

  1. Scenario Curation: Each input is a formally validated scenario JSON output by AgentDrive-Sim, annotated along seven factorized axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. These axes ensure entropy-maximized, approximately uniform coverage across scenario classes.
  2. Scenario Description Synthesis: An LLM is prompted to generate a concise, natural-language description (10–12 sentences, including explicit vehicle states and events) from the scenario JSON.
  3. Reasoning-Style-Constrained MCQ Generation: For each description, separate LLM calls synthesize one MCQ per reasoning style (physics, policy, hybrid, scenario, comparative), required to produce exactly four answer choices, a single gold index ii^*, and a rationale string. Strict schema, length, and distinctness checks are enforced, with error-guided retries (up to R=5R=5).

All MCQs are persisted as structured JSON, with unique SHA-256-based identifiers to guarantee traceability and replayability (Ferrag et al., 23 Jan 2026).

3. Reasoning Dimensions and Style Formalizations

Each AgentDrive-MCQ question is associated with one of five formal reasoning styles and mapped to a difficulty band (easy/medium/hard, balanced at ≈33,333 per band):

  • Physics: Requires explicit numeric computation using scenario-provided kinematics (e.g., time-to-collision TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t)) if vE>vVv_E > v_V and xV>xEx_V > x_E; TTCmin=mintTTC(t)TTC_{min} = \min_t TTC(t)). Questions demand precise calculation of stopping distances, collision times, or braking margins.
  • Policy: Focuses on traffic law, rule-following, and normative rules (e.g., “2-second” headway, right-of-way at intersections). Key factual relationships are non-numeric.
  • Hybrid: Integrates physics and policy, e.g., combining the physical minimum headway gphys=vE2/(2amax)g_{phys} = v_E^2/(2a_{max}) with a policy margin τpolicyvE\tau_{policy}\cdot v_E to yield ghybrid=gphys+τpolicyvEg_{hybrid} = g_{phys} + \tau_{policy}\cdot v_E.
  • Scenario (Interpretive): Requires hazard identification or prioritization within a described situation (e.g., “Which factor is the greatest immediate risk?”). The correct response depends on contextual awareness and qualitative inference.
  • Comparative: Asks for optimal maneuver selection given multiple candidates (e.g., brake, lane change, accelerate), requiring comparative and optimization-based judgment under uncertainty.

Examples for each style are provided within the benchmark documentation; see Table 1 for concrete question breakdowns.

Style Example Task Key Formula / Principle
Physics Compute TTCminTTC_{min} R=5R=50, R=5R=51
Policy Headway compliance R=5R=52 rules
Hybrid Compute composite safe margin R=5R=53
Scenario Hazard identification Qualitative inference
Comparative Select optimal driver action Utility and safety tradeoffs

4. Surrogate Metric Computation and Rule-Based Labeling

Every MCQ is grounded in a scenario whose simulation rollout is analyzed to generate surrogate safety metrics and discrete event labels. The primary quantitative metric is minimum time-to-collision (R=5R=54, with R=5R=55, R=5R=56). Additional binary events are labeled as R=5R=57, R=5R=58, R=5R=59, and TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))0; the event set is used to derive task outcome labels (TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))1unsafe, safe_goal, safe_stop, inefficientTTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))2), which provide context for both MCQ rationale and scenario grounding (Ferrag et al., 23 Jan 2026).

5. Distributional Statistics and Dataset Coverage

AgentDrive-MCQ comprises exactly TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))3 MCQs, evenly divided across the five reasoning styles (TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))4 each). The distribution across scenario axes (type, behavior, environment, etc.) is approximately uniform by entropy maximization, with scenario selection and question generation coupled to maintain label balance and difficulty stratification.

Three difficulty levels (easy, medium, hard) are explicitly enforced at dataset construction by controlling scenario conditions and required calculation or judgment depth (TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))5 per band). This granularity enables reliable stratified evaluation of model performance as a function of complexity, safety-criticality, and reasoning required (Ferrag et al., 23 Jan 2026).

6. Evaluation Protocols and Baseline Model Results

The benchmark’s evaluation protocol draws on a held-out test set of TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))6 MCQs per style (TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))7 total), Uniformly sampled across axes and difficulties. The evaluation metrics are:

  • AccuracyTTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))8 for each reasoning style TTC(t)=(xV(t)xE(t))/(vE(t)vV(t))TTC(t) = (x_V(t)-x_E(t))/(v_E(t)-v_V(t))9, defined as vE>vVv_E > v_V0.
  • Overall Accuracy as the mean across styles.
  • Safety Compliance Rate (SCR): Mean accuracy on policy and scenario styles.
  • Situational Awareness Score (SAS): Mean of comparative, hybrid, and physics styles.

Fifty competitive LLMs—including both proprietary frontier (GPT-5 Chat, ChatGPT-4o, Gemini 2.5 Flash, Grok-4) and advanced open models (Qwen3 235B, Mistral Medium, ERNIE 4.5 300B)—are evaluated. Frontier models reach overall accuracy up to 82.5% (ChatGPT 4o), with policy and scenario accuracy near ceiling (vE>vVv_E > v_V195%). Physics and hybrid reasoning remain most challenging (physics top: 67.5%, hybrid top: 72.5%), with open models closing the gap on structured styles but trailing on composite and physics-grounded logic (Ferrag et al., 23 Jan 2026). Analysis of SCR vs. SAS reveals that frontier models are highly balanced, whereas advanced open models may exhibit high rule-following with lower quantitative acumen.

7. Implementation, Replication, and Research Extensions

The AgentDrive-MCQ generation and evaluation framework is fully automated, with agent pipelines (PDF extraction, scenario synthesis, MCQ generation, multi-criterion evaluation) configurable for other MCQ domains. The pipeline includes:

  • Versioned prompt templates and model hyperparameter logging.
  • Automated schema validation and incomplete-item rejection.
  • Full provenance recording (input, output, prompt, model, run).
  • Quality control via human spot checks (5–10% sampling).
  • Pseudocode and reproduction guidelines for adaptation across domains, including statistical equivalence gates and detailed audit logs (An, 21 Feb 2026).

Every MCQ and evaluation result is traceable from simulation or original document source through prompt versions, agent calls, output artifacts, rubric scores, and statistical results. Weaknesses identified in generated MCQs (relative to expert-authored items) are concentrated in skill depth, cognitive engagement, and distractor plausibility, while accuracy and clarity criteria are consistently reliable.

This suggests that while the AgentDrive-MCQ pipeline is robust for broad reasoning assessment, closing the gap in conceptual depth and domain calibration remains an open research avenue.

References

  • "AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems" (Ferrag et al., 23 Jan 2026).
  • "Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation" (An, 21 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentDrive-MCQ.