ScenePilot-Bench: Driving VLM Evaluation

Updated 3 February 2026

ScenePilot-Bench is a benchmark designed to evaluate vision-language models in first-person autonomous driving scenarios using multi-axis metrics like scene understanding and motion planning.
The benchmark integrates a vast, globally diverse dataset with layered annotations to assess risk, spatial perception, and motion planning in real-world driving conditions.
Its comprehensive evaluation protocol highlights capability gaps and generalization challenges, guiding future research on safe AI co-pilots for autonomous vehicles.

ScenePilot-Bench is a large-scale benchmark designed to advance evaluation of vision-LLMs (VLMs) within safety-critical, first-person autonomous driving scenarios. Built on the extensive ScenePilot-4K dataset, which encompasses nearly 4,000 hours of globally diverse driving video, ScenePilot-Bench introduces multi-granularity annotations and a comprehensive four-axis evaluation protocol targeting scene understanding, spatial perception, motion planning, and generative alignment. Its integrated metrics, region-robustness protocols, and empirical analyses provide detailed insights into both capability and generalization gaps for modern VLMs operating as driving co-pilots (Wang et al., 27 Jan 2026).

1. Dataset Design and Annotation Structure

ScenePilot-Bench is constructed from ScenePilot-4K, comprising 3,847 hours of first-person, monocular front-view driving video, segmented into non-overlapping 5 s clips (10 frames per clip) and sampled at 2 FPS, resulting in 27.7 million frames. Data spans 63 countries/regions, covering 1,210 cities with both right- (97.6%) and left-hand (2.4%) traffic paradigms.

Camera intrinsics ( $K \in \mathbb R^{3\times3}$ ) and per-frame extrinsics ( $(R_t, t_t)$ ) are recovered with VGGT; pose estimation allows computation of ego-trajectory and camera-to-world mappings via

$C_t = -R_t^\top t_t, \qquad T_{\mathrm{ego}} = \{C_t\}_{t=1}^{10} \in \mathbb R^{10\times3}.$

Annotations feature layered granularity:

Scene Descriptions & Risk Assessment: Natural language, multi-attribute scene description with discrete risk score $r \in \{1, \dots, 10\}$ , grouped into low (1–3), medium (4–7), and high (8–10) risk levels.
Key Participant Detection: Objects (vehicle, truck, bicycle, motorcycle, pedestrian) detected via YOLO11s, per-class detection score thresholds, with each detection storing ID, class, normalized bounding box.
Ego-Trajectory and Camera Parameters: Per-frame pose, robust metric scale via depth-assisted 3D back-projection and ground estimation (see equations (6)–(9) in (Wang et al., 27 Jan 2026)).
Foreground Masks: Segmentation derived from SAM with morphological post-processing.
Spatial Quantities: Includes per-object ego-centric distances/angles ( $D_{t,i}$ , $\phi_{t,i}$ ), pairwise proximities ( $\delta(i, j)$ ).

Statistical distributions across weather, time-of-day, road types, intersections, lanes, and risk levels enhance scene variety (e.g., urban 60.3%, highway 19.0%, low-risk 49.5%, medium-risk 49.7%, high-risk 0.8%).

2. Four-Axis Evaluation Protocol

ScenePilot-Bench employs a four-axis evaluation protocol, integrating “higher-is-better” and “lower-is-better” metrics normalized on a [0,100] scale, with module- and sub-metric weighting for comprehensive, interpretable model ranking.

2.1. Scene Understanding

SPICE Score (driving-refined):

$P = \frac{|T(c)\cap T(S)|}{|T(c)|},\quad R = \frac{|T(c)\cap T(S)|}{|T(S)|},\quad \mathrm{SPICE} = \frac{2PR}{P+R}$

Risk-Class Accuracy (Risk-Class-Acc):

$\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat r_i = r_i)$

2.2. Spatial Perception

2.2.1. Object Classification

Class Accuracy (Class-Acc):

$\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat c_i = c_i)$

2.2.2. Spatial Reasoning

Mean Relative Distance Error to Ego (EMRDE):

$\frac{1}{N}\sum_{i=1}^N \frac{|\hat d_i - d_i|}{d_i}$

Mean Relative Angle Error to Ego (EMRAE):

$\frac{1}{N}\sum_{i=1}^N \frac{|\hat\phi_i - \phi_i|}{\max(|\phi_i|, \epsilon)}$

Object-centric metrics (OMRDE, OMRAE): Replace ( $d_i, \phi_i$ ) with pairwise ( $d_{ij}, \phi_{ij}$ ).

2.3. Motion Planning

2.3.1. Meta-Action Prediction

Action Classes: Accelerate, Brake, ConstSpeed, LeftTurn, RightTurn, GoStraight—assigned by $a \gtrless \pm0.15$ m/s², $\Delta\theta \gtrless \pm8^\circ$ .
Direction Consistency Accuracy (DCS-Acc):

$\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat A_i = A_i^{\rm gt})$

Mean Relative Acceleration Error (MRE-Acc):

$\frac{1}{N}\sum |\hat a_i - a_i| / |a_i|$

Angular Relative Error (ARE):

$\frac{1}{N}\sum\frac{|\widehat{\Delta\theta}_i - \Delta\theta_i|}{\max(|\Delta\theta_i|, \epsilon)}$

2.3.2. Trajectory Planning

Average Displacement Error (ADE):

$\mathrm{ADE} = \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T \|\hat p_t^i - p_t^i\|_2$

Final Displacement Error at Horizon T (FDE@T):

$\mathrm{FDE}@T = \frac{1}{N} \sum_{i=1}^N \|\hat p_T^i - p_T^i\|_2$

2.4. GPT-Score

Semantic Alignment Assessment: GPT-4o assigns a score in [0,1], rescaled to [0,100].

2.5. Normalization and Weighting

Scores for non-error metrics use $S=100\times M$ ; error metrics ( $E$ ), e.g., EMRDE, ADE, are normalized via a piecewise function using dataset-derived thresholds $(x_1, x_2, k)$ . The overall benchmark score is a weighted sum:

Scene Understanding: 15%
Spatial Perception: 35%
Motion Planning: 40%
GPT-Score: 10% with intra-module weights (e.g., SPICE is 70% of Scene Understanding).

3. Generalization and Robustness Protocols

ScenePilot-Bench establishes cross-region and traffic-system robustness protocols:

Leave-One-Country-Out (LOCO): Train on all but one country, test on held-out.
Right-to-Left (R→L): Train on right-hand traffic, test on left-hand.

Performance drop on out-of-domain splits quantifies generalization capability and sensitivity to local traffic conventions.

4. Empirical Benchmarking of Vision-LLMs

ScenePilot-Bench evaluates multiple classes of VLMs:

Large general-purpose models: GPT-4o, GPT-5, Qwen2.5-VL-72B, Gemini-2.5-flash, Qwen3-VL-235B.
Driving-specialized models: ReasonDrive-7B (including chain-of-thought and fine-tuned variants).
ScenePilot models: ScenePilot-2 (2B parameters) and ScenePilot-2.5 (3B parameters), fine-tuned on 200k VQA samples.

Representative results:

Model	Scene Semantics	Spatial Perception	Motion Planning	Overall Score
GPT-4o (Gen. VLM)	≈ 92%	≈ 45–50	≈ 20–35	Low–Mid
ReasonDrive-7B	70–80	≈ 58–60	76–77	Moderate
ScenePilot-2.5 (3B)	89.2	74.5	54.4	65.37

General-purpose VLMs excel in scene understanding (SPICE) but underperform on spatial perception and motion planning, evidencing a semantic–embodiment gap. Specialized models trained on ScenePilot-4K achieve higher scores in spatial and planning modules.

5. Observed Challenges and Research Directions

Empirical analysis reveals:

Generalization Gaps: Scene semantics transfer robustly; motion planning degrades under LOCO and R→L adaptation.
Risk Reasoning: Remains sensitive to local driving rules and culture-dependent conventions.
Limitations: Handling rare or high-risk events with limited exemplars; rule compliance under region-specific regulations; bridging open-loop VLM performance with closed-loop, simulated or physical control remains an open problem.

Potential research avenues include integration of explicit traffic-rule encodings, expanded annotation of rare/safety-critical events (e.g., near-collisions), and leveraging ScenePilot-Bench in simulation frameworks for closed-loop, end-to-end controller evaluation.

6. Significance and Prospective Impact

ScenePilot-Bench establishes a unified, region-diverse, finely annotated, and systematically benchmarked dataset for holistic evaluation of vision-LLMs in driving. Its four-axis protocol, rigorous normalization, and cross-domain generalization settings make it a salient resource for measuring and improving the embodied reasoning, spatial awareness, and safety understanding of AI co-pilots in autonomous driving. Its empirical analyses and reported performance gaps set clear directions for future research and provide the substrate for methodical advancement in both open-domain and specialized, safety-aware vision-language intelligence (Wang et al., 27 Jan 2026).

Markdown Upgrade to Chat

References (1)

ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScenePilot-Bench.