ScenePilot-Bench: Driving VLM Evaluation
- ScenePilot-Bench is a benchmark designed to evaluate vision-language models in first-person autonomous driving scenarios using multi-axis metrics like scene understanding and motion planning.
- The benchmark integrates a vast, globally diverse dataset with layered annotations to assess risk, spatial perception, and motion planning in real-world driving conditions.
- Its comprehensive evaluation protocol highlights capability gaps and generalization challenges, guiding future research on safe AI co-pilots for autonomous vehicles.
ScenePilot-Bench is a large-scale benchmark designed to advance evaluation of vision-LLMs (VLMs) within safety-critical, first-person autonomous driving scenarios. Built on the extensive ScenePilot-4K dataset, which encompasses nearly 4,000 hours of globally diverse driving video, ScenePilot-Bench introduces multi-granularity annotations and a comprehensive four-axis evaluation protocol targeting scene understanding, spatial perception, motion planning, and generative alignment. Its integrated metrics, region-robustness protocols, and empirical analyses provide detailed insights into both capability and generalization gaps for modern VLMs operating as driving co-pilots (Wang et al., 27 Jan 2026).
1. Dataset Design and Annotation Structure
ScenePilot-Bench is constructed from ScenePilot-4K, comprising 3,847 hours of first-person, monocular front-view driving video, segmented into non-overlapping 5 s clips (10 frames per clip) and sampled at 2 FPS, resulting in 27.7 million frames. Data spans 63 countries/regions, covering 1,210 cities with both right- (97.6%) and left-hand (2.4%) traffic paradigms.
Camera intrinsics () and per-frame extrinsics () are recovered with VGGT; pose estimation allows computation of ego-trajectory and camera-to-world mappings via
Annotations feature layered granularity:
- Scene Descriptions & Risk Assessment: Natural language, multi-attribute scene description with discrete risk score , grouped into low (1–3), medium (4–7), and high (8–10) risk levels.
- Key Participant Detection: Objects (vehicle, truck, bicycle, motorcycle, pedestrian) detected via YOLO11s, per-class detection score thresholds, with each detection storing ID, class, normalized bounding box.
- Ego-Trajectory and Camera Parameters: Per-frame pose, robust metric scale via depth-assisted 3D back-projection and ground estimation (see equations (6)–(9) in (Wang et al., 27 Jan 2026)).
- Foreground Masks: Segmentation derived from SAM with morphological post-processing.
- Spatial Quantities: Includes per-object ego-centric distances/angles (, ), pairwise proximities ().
Statistical distributions across weather, time-of-day, road types, intersections, lanes, and risk levels enhance scene variety (e.g., urban 60.3%, highway 19.0%, low-risk 49.5%, medium-risk 49.7%, high-risk 0.8%).
2. Four-Axis Evaluation Protocol
ScenePilot-Bench employs a four-axis evaluation protocol, integrating “higher-is-better” and “lower-is-better” metrics normalized on a [0,100] scale, with module- and sub-metric weighting for comprehensive, interpretable model ranking.
2.1. Scene Understanding
- SPICE Score (driving-refined):
- Risk-Class Accuracy (Risk-Class-Acc):
2.2. Spatial Perception
2.2.1. Object Classification
- Class Accuracy (Class-Acc):
2.2.2. Spatial Reasoning
- Mean Relative Distance Error to Ego (EMRDE):
- Mean Relative Angle Error to Ego (EMRAE):
- Object-centric metrics (OMRDE, OMRAE): Replace () with pairwise ().
2.3. Motion Planning
2.3.1. Meta-Action Prediction
- Action Classes: Accelerate, Brake, ConstSpeed, LeftTurn, RightTurn, GoStraight—assigned by m/s², .
- Direction Consistency Accuracy (DCS-Acc):
- Mean Relative Acceleration Error (MRE-Acc):
- Angular Relative Error (ARE):
2.3.2. Trajectory Planning
- Average Displacement Error (ADE):
- Final Displacement Error at Horizon T (FDE@T):
2.4. GPT-Score
- Semantic Alignment Assessment: GPT-4o assigns a score in [0,1], rescaled to [0,100].
2.5. Normalization and Weighting
Scores for non-error metrics use ; error metrics (), e.g., EMRDE, ADE, are normalized via a piecewise function using dataset-derived thresholds . The overall benchmark score is a weighted sum:
- Scene Understanding: 15%
- Spatial Perception: 35%
- Motion Planning: 40%
- GPT-Score: 10% with intra-module weights (e.g., SPICE is 70% of Scene Understanding).
3. Generalization and Robustness Protocols
ScenePilot-Bench establishes cross-region and traffic-system robustness protocols:
- Leave-One-Country-Out (LOCO): Train on all but one country, test on held-out.
- Right-to-Left (R→L): Train on right-hand traffic, test on left-hand.
Performance drop on out-of-domain splits quantifies generalization capability and sensitivity to local traffic conventions.
4. Empirical Benchmarking of Vision-LLMs
ScenePilot-Bench evaluates multiple classes of VLMs:
- Large general-purpose models: GPT-4o, GPT-5, Qwen2.5-VL-72B, Gemini-2.5-flash, Qwen3-VL-235B.
- Driving-specialized models: ReasonDrive-7B (including chain-of-thought and fine-tuned variants).
- ScenePilot models: ScenePilot-2 (2B parameters) and ScenePilot-2.5 (3B parameters), fine-tuned on 200k VQA samples.
Representative results:
| Model | Scene Semantics | Spatial Perception | Motion Planning | Overall Score |
|---|---|---|---|---|
| GPT-4o (Gen. VLM) | ≈ 92% | ≈ 45–50 | ≈ 20–35 | Low–Mid |
| ReasonDrive-7B | 70–80 | ≈ 58–60 | 76–77 | Moderate |
| ScenePilot-2.5 (3B) | 89.2 | 74.5 | 54.4 | 65.37 |
General-purpose VLMs excel in scene understanding (SPICE) but underperform on spatial perception and motion planning, evidencing a semantic–embodiment gap. Specialized models trained on ScenePilot-4K achieve higher scores in spatial and planning modules.
5. Observed Challenges and Research Directions
Empirical analysis reveals:
- Generalization Gaps: Scene semantics transfer robustly; motion planning degrades under LOCO and R→L adaptation.
- Risk Reasoning: Remains sensitive to local driving rules and culture-dependent conventions.
- Limitations: Handling rare or high-risk events with limited exemplars; rule compliance under region-specific regulations; bridging open-loop VLM performance with closed-loop, simulated or physical control remains an open problem.
Potential research avenues include integration of explicit traffic-rule encodings, expanded annotation of rare/safety-critical events (e.g., near-collisions), and leveraging ScenePilot-Bench in simulation frameworks for closed-loop, end-to-end controller evaluation.
6. Significance and Prospective Impact
ScenePilot-Bench establishes a unified, region-diverse, finely annotated, and systematically benchmarked dataset for holistic evaluation of vision-LLMs in driving. Its four-axis protocol, rigorous normalization, and cross-domain generalization settings make it a salient resource for measuring and improving the embodied reasoning, spatial awareness, and safety understanding of AI co-pilots in autonomous driving. Its empirical analyses and reported performance gaps set clear directions for future research and provide the substrate for methodical advancement in both open-domain and specialized, safety-aware vision-language intelligence (Wang et al., 27 Jan 2026).