RoadSceneBench: Urban Scene Benchmark

Updated 3 December 2025

RoadSceneBench is a lightweight benchmark providing mid-level road semantics with relational annotations for urban driving, emphasizing structured reasoning about lane connectivity and scene topology.
The dataset includes 2,341 five-frame video snippets from 20 cities with 163,000 detailed labels capturing lane graphs, drivable areas, and dynamic traffic conditions.
The HRRP-T framework integrated in RoadSceneBench enhances temporal consistency and spatial coherence, yielding over 15 percentage points improvement in precision and recall compared to non-relational baselines.

RoadSceneBench is a lightweight, information-rich benchmark for evaluating mid-level road semantics and relational scene understanding in urban driving environments. Unlike traditional perception datasets focused on pixel- or box-level recognition, RoadSceneBench emphasizes structured reasoning about road topology, lane connectivity, and dynamic scene structure. It is camera-only, with richly annotated video clips across diverse urban centers, and supports the development and evaluation of structure-aware reasoning models—especially vision-LLMs (VLMs)—capable of extracting map-relevant, temporally consistent mid-level scene attributes (Liu et al., 27 Nov 2025).

1. Dataset Composition and Annotation Schema

RoadSceneBench comprises 2,341 five-frame video snippets (11,705 frames, 4096 × 2160 px), geographically covering 20 cities including Beijing and Shanghai. Frames are annotated with approximately 163,000 mid-level semantic labels, spanning six structurally coupled attributes:

Lane Graph Topology: Lane Count (integer ≥ 1); Ego-lane Index (1 ≤ i ≤ Lane Count)
Drivable-area Connectivity: Binary flags for Junction, Entrance-ramp, Exit-ramp presence
Object-level Relations: Lane-change Feasibility to left/right (binary; only defined if adjacent lane exists)
Contextual Semantics: Traffic Condition (Free-flow, Moderate, Congestion); Road Scene Type (Urban, Suburban, Highway)

Annotations explicitly capture relational dependencies and structure. Train/val/test splits follow a 70/15/15 ratio by snippet, ensuring no overlap between sets.

2. Task Formulation and Structurally Coupled Prediction

RoadSceneBench targets mid-level scene inference: the task is to predict, given an image (optionally with temporal neighbors), the six interdependent attributes associated with that frame:

Lane count ( $|L|$ ), the number of visually discernible lanes
Ego-lane index (integer, 1 to $|L|$ ), the current lane of the ego vehicle
Junction, entrance, exit flags (binary)
Lane-change feasibility (binary, context-dependent)
Traffic condition and road scene type (multi-class)

Structural coupling among predictions is enforced: e.g., $1 \leq \text{ego\_lane\_index} \leq \text{lane\_count}$ , and lane-change is only defined if adjacent lanes exist. This design encourages models to capture the logical and spatial constraints that underpin real-world driving scenarios.

3. Benchmark Metrics and Structural Evaluation

Primary evaluation metrics are precision (P), recall (R), and F1, computed per attribute:

$P = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \qquad R = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \qquad F_{1} = 2\frac{P\,R}{P+R}$

For integer attributes (lane count, ego-lane index): “exact match” is required for TP.
For multiclass (traffic condition, scene type): metrics are macro-averaged.
For binary (junction, ramp, lane-change): standard binary classification metrics apply.

Additional structural/graphical metrics:

Graph connectivity error (EdgeErr): compares predicted lane-graph connectivity to ground-truth.
Temporal consistency score (TC): quantifies label smoothness and plausibility across the five-frame sequence:

$\mathrm{TC} = 1 - \frac{1}{(T-1)\,|\mathcal{A}|}\sum_{a\in\mathcal{A}}\sum_{t=1}^{T-1}\mathbf{1}\left(y^{a}_{t}\neq y^{a}_{t+1}\right)$

where $\mathcal{A}$ is the set of attributes.

4. HRRP-T: Hierarchical Relational Reward Propagation with Temporal Consistency

RoadSceneBench introduces the Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) paradigm for VLM training. The HRRP-T framework combines spatial, topological, and relational rewards at the frame level with temporal smoothness and plausibility constraints over sequences. The objective function is:

$\mathcal R_{\mathrm{frame}}^t = \alpha\,R^{t}_{sce} + \beta\,R^{t}_{rel} + \gamma\,R^{t}_{sem}$

$\mathcal R_{\mathrm{temporal}} = \lambda\,\mathcal R_{\mathrm{smooth}} + (1-\lambda)\,\mathcal R_{\mathrm{plausible}}$

$\mathcal R_{\mathrm{HRRP\text{-}T}} = \lambda_{\mathrm{frame}}\frac{1}{T}\sum_{t=1}^{T}\mathcal R^{t}_{\mathrm{frame}} + \lambda_{\mathrm{temporal}}\mathcal R_{\mathrm{temporal}}$

Supervised fine-tuning (SFT) on labeled data is complemented with reinforcement learning using a self-critical policy gradient, optimizing:

$\mathcal L_{\mathrm{total}} = \mathcal L_{\mathrm{SFT}} + \eta\,\mathcal L_{\mathrm{RL}}$

where

$\mathcal L_{\mathrm{RL}} = -\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\mathcal R_{\mathrm{HRRP\text{-}T}}(\tau)\right]$

Ablation studies demonstrate that the hierarchical (vs. flat) organization and temporal component each substantially improve both spatial consistency and temporal coherence.

5. Model Baselines and Experimental Results

Benchmarked models include leading closed-source VLMs (GPT-4o, Gemini-2.5-Pro, Claude-3.7-Sonnet) and a range of open-source VLMs (ERNIE-4.5-VL-28B, DeepSeek-VL2, LLaVA-Onevision, InternVL3, Qwen2.5-VL, Qwen3-VL). MapVLM, under both SFT and SFT+HRRP-T regimes, achieves the best performance:

Method	Precision (%)	Recall (%)
Gemini-2.5-Pro	60.61	52.70
Qwen3-VL-8B	57.34	43.82
MapVLM (SFT)	72.14	67.25
MapVLM (SFT+HRRP-T)	75.78	72.17

The addition of HRRP-T yields a >15 percentage point gain in both precision and recall over the best non-relational baseline. Removing temporal rewards drops recall (ego-lane index) by ~3 points.

Qualitative analyses reveal that HRRP-T trained models are robust to complex situations (e.g., congested intersections, occlusion), whereas baselines (including zero/few-shot open-source VLMs) often misclassify the ego-lane, lane-change feasibility, and scene type.

6. Differentiation from Prior Datasets and Future Directions

RoadSceneBench is distinguished from existing datasets as follows:

Compactness and focus: 11,705 frames vs. >100k in large-scale perception datasets, with camera-only input and dense mid-level semantics.
Relational annotation: Attributes are structurally coupled and annotated per frame, supporting research in relational and structural scene understanding.
Reasoning-centric evaluation: Explicitly designed for VLMs and reasoning models; complements perception and detailed HD map benchmarks.

Limitations currently include restriction to 20 Chinese cities, static semantics (no object instance grounding or transient events), and no explicit object-level mapping. Planned expansions involve broader geographic coverage, dynamic event annotation, and integration of fine-grained interaction reasoning. A plausible implication is that RoadSceneBench and HRRP-T lower the entry barrier for mid-level reasoning research while supporting continued advances in temporal and topological scene understanding (Liu et al., 27 Nov 2025).

Markdown Upgrade to Chat

References (1)

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoadSceneBench.