Papers
Topics
Authors
Recent
Search
2000 character limit reached

ScenePilot-Bench: Driving VLM Evaluation

Updated 3 February 2026
  • ScenePilot-Bench is a benchmark designed to evaluate vision-language models in first-person autonomous driving scenarios using multi-axis metrics like scene understanding and motion planning.
  • The benchmark integrates a vast, globally diverse dataset with layered annotations to assess risk, spatial perception, and motion planning in real-world driving conditions.
  • Its comprehensive evaluation protocol highlights capability gaps and generalization challenges, guiding future research on safe AI co-pilots for autonomous vehicles.

ScenePilot-Bench is a large-scale benchmark designed to advance evaluation of vision-LLMs (VLMs) within safety-critical, first-person autonomous driving scenarios. Built on the extensive ScenePilot-4K dataset, which encompasses nearly 4,000 hours of globally diverse driving video, ScenePilot-Bench introduces multi-granularity annotations and a comprehensive four-axis evaluation protocol targeting scene understanding, spatial perception, motion planning, and generative alignment. Its integrated metrics, region-robustness protocols, and empirical analyses provide detailed insights into both capability and generalization gaps for modern VLMs operating as driving co-pilots (Wang et al., 27 Jan 2026).

1. Dataset Design and Annotation Structure

ScenePilot-Bench is constructed from ScenePilot-4K, comprising 3,847 hours of first-person, monocular front-view driving video, segmented into non-overlapping 5 s clips (10 frames per clip) and sampled at 2 FPS, resulting in 27.7 million frames. Data spans 63 countries/regions, covering 1,210 cities with both right- (97.6%) and left-hand (2.4%) traffic paradigms.

Camera intrinsics (KR3×3K \in \mathbb R^{3\times3}) and per-frame extrinsics ((Rt,tt)(R_t, t_t)) are recovered with VGGT; pose estimation allows computation of ego-trajectory and camera-to-world mappings via

Ct=Rttt,Tego={Ct}t=110R10×3.C_t = -R_t^\top t_t, \qquad T_{\mathrm{ego}} = \{C_t\}_{t=1}^{10} \in \mathbb R^{10\times3}.

Annotations feature layered granularity:

  • Scene Descriptions & Risk Assessment: Natural language, multi-attribute scene description with discrete risk score r{1,,10}r \in \{1, \dots, 10\}, grouped into low (1–3), medium (4–7), and high (8–10) risk levels.
  • Key Participant Detection: Objects (vehicle, truck, bicycle, motorcycle, pedestrian) detected via YOLO11s, per-class detection score thresholds, with each detection storing ID, class, normalized bounding box.
  • Ego-Trajectory and Camera Parameters: Per-frame pose, robust metric scale via depth-assisted 3D back-projection and ground estimation (see equations (6)–(9) in (Wang et al., 27 Jan 2026)).
  • Foreground Masks: Segmentation derived from SAM with morphological post-processing.
  • Spatial Quantities: Includes per-object ego-centric distances/angles (Dt,iD_{t,i}, ϕt,i\phi_{t,i}), pairwise proximities (δ(i,j)\delta(i, j)).

Statistical distributions across weather, time-of-day, road types, intersections, lanes, and risk levels enhance scene variety (e.g., urban 60.3%, highway 19.0%, low-risk 49.5%, medium-risk 49.7%, high-risk 0.8%).

2. Four-Axis Evaluation Protocol

ScenePilot-Bench employs a four-axis evaluation protocol, integrating “higher-is-better” and “lower-is-better” metrics normalized on a [0,100] scale, with module- and sub-metric weighting for comprehensive, interpretable model ranking.

2.1. Scene Understanding

  • SPICE Score (driving-refined):

P=T(c)T(S)T(c),R=T(c)T(S)T(S),SPICE=2PRP+RP = \frac{|T(c)\cap T(S)|}{|T(c)|},\quad R = \frac{|T(c)\cap T(S)|}{|T(S)|},\quad \mathrm{SPICE} = \frac{2PR}{P+R}

  • Risk-Class Accuracy (Risk-Class-Acc):

1Ni=1N1(r^i=ri)\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat r_i = r_i)

2.2. Spatial Perception

2.2.1. Object Classification

  • Class Accuracy (Class-Acc):

1Ni=1N1(c^i=ci)\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat c_i = c_i)

2.2.2. Spatial Reasoning

  • Mean Relative Distance Error to Ego (EMRDE):

1Ni=1Nd^ididi\frac{1}{N}\sum_{i=1}^N \frac{|\hat d_i - d_i|}{d_i}

  • Mean Relative Angle Error to Ego (EMRAE):

1Ni=1Nϕ^iϕimax(ϕi,ϵ)\frac{1}{N}\sum_{i=1}^N \frac{|\hat\phi_i - \phi_i|}{\max(|\phi_i|, \epsilon)}

  • Object-centric metrics (OMRDE, OMRAE): Replace (di,ϕid_i, \phi_i) with pairwise (dij,ϕijd_{ij}, \phi_{ij}).

2.3. Motion Planning

2.3.1. Meta-Action Prediction

  • Action Classes: Accelerate, Brake, ConstSpeed, LeftTurn, RightTurn, GoStraight—assigned by a±0.15a \gtrless \pm0.15 m/s², Δθ±8\Delta\theta \gtrless \pm8^\circ.
  • Direction Consistency Accuracy (DCS-Acc):

1Ni=1N1(A^i=Aigt)\frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat A_i = A_i^{\rm gt})

  • Mean Relative Acceleration Error (MRE-Acc):

1Na^iai/ai\frac{1}{N}\sum |\hat a_i - a_i| / |a_i|

  • Angular Relative Error (ARE):

1NΔθ^iΔθimax(Δθi,ϵ)\frac{1}{N}\sum\frac{|\widehat{\Delta\theta}_i - \Delta\theta_i|}{\max(|\Delta\theta_i|, \epsilon)}

2.3.2. Trajectory Planning

  • Average Displacement Error (ADE):

ADE=1NTi=1Nt=1Tp^tipti2\mathrm{ADE} = \frac{1}{NT} \sum_{i=1}^N \sum_{t=1}^T \|\hat p_t^i - p_t^i\|_2

  • Final Displacement Error at Horizon T (FDE@T):

FDE@T=1Ni=1Np^TipTi2\mathrm{FDE}@T = \frac{1}{N} \sum_{i=1}^N \|\hat p_T^i - p_T^i\|_2

2.4. GPT-Score

  • Semantic Alignment Assessment: GPT-4o assigns a score in [0,1], rescaled to [0,100].

2.5. Normalization and Weighting

Scores for non-error metrics use S=100×MS=100\times M; error metrics (EE), e.g., EMRDE, ADE, are normalized via a piecewise function using dataset-derived thresholds (x1,x2,k)(x_1, x_2, k). The overall benchmark score is a weighted sum:

  • Scene Understanding: 15%
  • Spatial Perception: 35%
  • Motion Planning: 40%
  • GPT-Score: 10% with intra-module weights (e.g., SPICE is 70% of Scene Understanding).

3. Generalization and Robustness Protocols

ScenePilot-Bench establishes cross-region and traffic-system robustness protocols:

  • Leave-One-Country-Out (LOCO): Train on all but one country, test on held-out.
  • Right-to-Left (R→L): Train on right-hand traffic, test on left-hand.

Performance drop on out-of-domain splits quantifies generalization capability and sensitivity to local traffic conventions.

4. Empirical Benchmarking of Vision-LLMs

ScenePilot-Bench evaluates multiple classes of VLMs:

  • Large general-purpose models: GPT-4o, GPT-5, Qwen2.5-VL-72B, Gemini-2.5-flash, Qwen3-VL-235B.
  • Driving-specialized models: ReasonDrive-7B (including chain-of-thought and fine-tuned variants).
  • ScenePilot models: ScenePilot-2 (2B parameters) and ScenePilot-2.5 (3B parameters), fine-tuned on 200k VQA samples.

Representative results:

Model Scene Semantics Spatial Perception Motion Planning Overall Score
GPT-4o (Gen. VLM) ≈ 92% ≈ 45–50 ≈ 20–35 Low–Mid
ReasonDrive-7B 70–80 ≈ 58–60 76–77 Moderate
ScenePilot-2.5 (3B) 89.2 74.5 54.4 65.37

General-purpose VLMs excel in scene understanding (SPICE) but underperform on spatial perception and motion planning, evidencing a semantic–embodiment gap. Specialized models trained on ScenePilot-4K achieve higher scores in spatial and planning modules.

5. Observed Challenges and Research Directions

Empirical analysis reveals:

  • Generalization Gaps: Scene semantics transfer robustly; motion planning degrades under LOCO and R→L adaptation.
  • Risk Reasoning: Remains sensitive to local driving rules and culture-dependent conventions.
  • Limitations: Handling rare or high-risk events with limited exemplars; rule compliance under region-specific regulations; bridging open-loop VLM performance with closed-loop, simulated or physical control remains an open problem.

Potential research avenues include integration of explicit traffic-rule encodings, expanded annotation of rare/safety-critical events (e.g., near-collisions), and leveraging ScenePilot-Bench in simulation frameworks for closed-loop, end-to-end controller evaluation.

6. Significance and Prospective Impact

ScenePilot-Bench establishes a unified, region-diverse, finely annotated, and systematically benchmarked dataset for holistic evaluation of vision-LLMs in driving. Its four-axis protocol, rigorous normalization, and cross-domain generalization settings make it a salient resource for measuring and improving the embodied reasoning, spatial awareness, and safety understanding of AI co-pilots in autonomous driving. Its empirical analyses and reported performance gaps set clear directions for future research and provide the substrate for methodical advancement in both open-domain and specialized, safety-aware vision-language intelligence (Wang et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScenePilot-Bench.