AgentDrive Benchmark Suite for Autonomous AI
- AgentDrive Benchmark Suite is an open framework featuring 300,000 LLM-generated, factorized driving scenarios paired with rigorous simulations.
- It employs a seven-axis factorized scenario space and a prompt-to-JSON pipeline to ensure diverse and reproducible simulation scenarios.
- The suite integrates surrogate safety metrics and an MCQ evaluation system to assess LLM performance in physics, policy, and risk assessment.
AgentDrive is an open benchmark suite designed to advance the evaluation and training of agentic artificial intelligence in autonomous driving contexts, particularly targeting LLM-based reasoning. It provides a structured, large-scale dataset of 300,000 LLM-generated driving scenarios, each paired with physical simulations, surrogate safety metrics, rule-based outcome labeling, and an associated multiple-choice benchmark focused on diverse reasoning skills. The suite aims to fill a gap in rigorous, safety-critical benchmarking essential for the research and development of AI reasoning in autonomous systems (Ferrag et al., 23 Jan 2026).
1. Factorized Scenario Space
A core architectural feature of AgentDrive is the formalization of driving scenarios as factorized tuples along seven orthogonal axes. Each scenario is a 7-tuple: with axes and examples:
| Axis | Symbol | Interpretation/Examples |
|---|---|---|
| Scenario type | lane change, braking failure, intersection crossing | |
| Driver behavior | aggressive, compliant, impaired, convoy-following | |
| Environment | clear, fog, rain, night, sandstorm | |
| Road layout/topology | roundabout, cloverleaf, mountain pass, urban intersection | |
| Ego-vehicle objective | safe navigation, overtake, U-turn, emergency stop | |
| Difficulty level | easy, medium, hard (with associated constraints ) | |
| Traffic density | low, medium, high |
The full scenario space is defined as the Cartesian product: 0 The total number of distinct scenarios is: 1 with 2 the cardinalities of each set. Axis cardinalities are tuned such that 3. For example, 4 yield 5, with adjustments achieving 6 exact. Entropy-maximized sampling ensures diverse, balanced coverage. Each axis is treated independently, enabling stress-testing along isolated or compound scenario dimensions (e.g., impaired driver in fog on a cloverleaf at high density) (Ferrag et al., 23 Jan 2026).
2. LLM-Driven Prompt-to-JSON Scenario Generation
AgentDrive adopts a prompt-to-JSON pipeline leveraging LLMs for semantically rich, simulation-ready scenario specification. The process comprises:
- Prompt Construction: For each scenario 7 and difficulty mapping 8, a natural-language encoding 9 is constructed, augmented with numeric constraints from 0. This yields the composite prompt 1.
- LLM Generation: An LLM generator 2 produces a candidate JSON 3.
- Schema Validation and Repair: Each JSON is validated against schema 4. If invalid, a repair module 5 is applied up to 6 retries, yielding a validated sample 7.
Technical and domain constraints ensure simulation correctness, including velocity and acceleration bounds, minimum spawn distances before stop lines, step count consistency with traffic light phases, and schema compliance for road topology, agent definitions, and event fields. Outputs comprise a complete, schema-conforming JSON description for each scenario, suitable for direct ingestion by the downstream simulation pipeline (Ferrag et al., 23 Jan 2026).
3. Simulation Rollouts and Surrogate Safety Metrics
Each validated scenario is executed in the highway-env simulator, producing a trajectory rollout 8 with 9 representing ego vehicle and environment kinematics. The principal surrogate safety metric is scenario-minimum time-to-collision (0) with the lead vehicle: 1
2
Thresholds 3 and 4 are enforced. Rule-based checks detect collision, red-light violations, and correct stopping/crossing behaviors. The final outcome label 5 is assigned via:
6
The full simulation corpus is
7
where each entry contains the scenario axes, JSON file, simulated trajectory, and label. This enables systematic analysis of policy robustness and safety-critical failure modes (Ferrag et al., 23 Jan 2026).
4. AgentDrive-MCQ: LLM Reasoning Benchmark
AgentDrive-MCQ augments the dataset with a structured, multiple-choice evaluation suite targeting model reasoning spanning five dimensions:
- Physics: Quantitative reasoning over vehicle kinematics (e.g., explicit TTC calculation).
- Policy: Normative reasoning over traffic laws and safety protocols.
- Hybrid: Compositional reasoning blending physics-based calculations and policy constraints.
- Scenario-interpretive: Qualitative risk and hazard analysis.
- Comparative: Multi-alternative selection of safest actions.
For each of 8 selected scenarios, 5 MCQs are generated, yielding 9 questions in total. The MCQ generation pipeline involves: LLM-generated 10–12 sentence scenario description, followed by constrained prompt-to-MCQ generation per style (physics, policy, hybrid, scenario, comparative), with enforced answer/rationale completeness and type checks. Every MCQ specifies: succinct question (≤25 words), four options, correct index, and rationale—in strict JSON schema, with meta fields for scenario and traceability.
This design permits fine-grained, style-specific assessment of LLM and agentic reasoning in policy, ethical, and physical driving contexts (Ferrag et al., 23 Jan 2026).
5. Evaluation Protocols and Model Comparison
AgentDrive supports benchmarking across proprietary and open LLMs. Fifty leading models were evaluated on a 0-sample subset of AgentDrive-MCQ. For model 1 and question style 2: 3 Overall accuracy is the unweighted mean across five styles.
Two composite metrics are defined:
- Safety Compliance Rate (SCR):
4
(measuring rule- and qualitative safety alignment)
- Situational Awareness Score (SAS):
5
(scoring numerical and decision-selection competence)
Key empirical findings:
| Model | Comparative | Hybrid | Physics | Policy | Scenario | Overall |
|---|---|---|---|---|---|---|
| ChatGPT 4o | 90.0 | 72.5 | 55.0 | 100 | 95.0 | 82.5 |
| GPT-5 Chat | 92.5 | 70.0 | 50.0 | 100 | 92.5 | 81.0 |
| Qwen3 235B A22B | 92.5 | 60.0 | 67.5 | 87.5 | 97.5 | 81.0 |
| ERNIE 4.5 300B | 85.0 | 45.0 | 52.5 | 95.0 | 97.5 | 75.0 |
| Mistral Medium 3.1 | 95.0 | 60.0 | 52.5 | 97.5 | 95.0 | 80.0 |
- Proprietary models excel in policy and scenario reasoning (6).
- Open models approach parity in physics and hybrid, with Qwen3 235B scoring 7 in physics.
- Hybrid questions remain the most challenging (8 for all), reflecting the difficulty of joint numeric-symbolic reasoning.
- Comparative reasoning is high (9 for several models), signifying the effectiveness of instruction tuning on textual choice tasks.
SCR and SAS exhibit strong positive correlation, suggesting situational awareness advances are closely linked with safety compliance capabilities in current LLMs (Ferrag et al., 23 Jan 2026).
6. Dataset, Code Availability, and Reproducibility
The entire AgentDrive benchmark suite is openly released at https://github.com/maferrag/AgentDrive under an open-source license. The release includes:
- AgentDrive-Gen & AgentDrive-Sim: 300,000 JSON scenario files (conforming to schema 0), corresponding simulation logs and surrogate safety metrics, and full outcome labels.
- AgentDrive-MCQ: 100,000 MCQs with scenario metadata, question, options, gold answer, rationale.
- Evaluation scripts: Colab-ready Python notebooks for parsing, simulation, surrogate metric computation (including TTC, headway, rule-based labeling), LLM question answering, and benchmark aggregation (Accuracy, SCR, SAS).
- Benchmarks and logs: CSV-format summaries of all model runs, and plotting scripts for detailed performance visualization.
Schema definitions, sample prompts, and workflow documentation are provided to ensure experimental reproducibility and encourage downstream research in generative scenario generation, cognitive and ethical evaluation of agentic models, and simulation-based safety assessment for autonomous systems (Ferrag et al., 23 Jan 2026).