Papers
Topics
Authors
Recent
2000 character limit reached

DTPQA: Distance-Annotated Traffic Perception QA

Updated 24 November 2025
  • The paper introduces DTPQA, a benchmark that isolates basic perception in traffic scenarios by utilizing distance-stratified annotations.
  • It employs both synthetic data from CARLA and real images from nuScenes to rigorously assess vision-language models on safety-critical queries.
  • Results demonstrate that VLM performance degrades with increasing object distance, emphasizing challenges in automated driving perception.

Distance-Annotated Traffic Perception Question Answering (DTPQA) is a visual question answering (VQA) benchmark specifically constructed to quantitatively evaluate the perception capabilities of Vision-LLMs (VLMs) within traffic scenarios, with a primary emphasis on the effects of object distance on perception accuracy. Designed to exclude high-level reasoning and instead isolate fundamental perception, DTPQA addresses critical needs in automated driving, where reliable detection and identification of objects are indispensable both at close (≤ 20 m) and far (> 30 m) ranges. The benchmark provides a structured, distance-stratified approach to probing perception-only skills using trivial yet traffic-relevant questions, with rigorous annotation and data generation protocols (Theodoridis et al., 17 Nov 2025).

1. Motivation and Scope

Automated driving systems require precise, distance-aware perception to guide safety-critical behaviors such as braking and lane changes. Preexisting VQA datasets for traffic scenes typically entangle perception with commonsense and world-knowledge reasoning, obscuring the fine-grained measurement of fundamental visual understanding. Moreover, almost no VQA benchmarks systematically assess perception degradation with increasing object distance. DTPQA explicitly addresses these gaps by:

  • Providing exclusively perception-driven, multiple-choice VQA tasks framed around basic but crucial driving questions (e.g., pedestrian presence, traffic light color).
  • Annotating every queried object’s precise distance from the camera, allowing for direct stratified evaluation across binned intervals (5, 10, 20, 30, 40, 50 m and "Negative"/absent).
  • Offering a dual-modal composition: a synthetic CARLA-based dataset (DTP-Synthetic) with consistent ground-truth control, and a real-image set (DTP-Real) curated from the nuScenes dataset.

2. Dataset Design and Construction

DTPQA consists of 19,149 total samples, split between DTP-Synthetic (9,368 samples) and DTP-Real (9,781 samples). Each sample comprises an RGB image, a posed question, a ground-truth answer, discrete and precise distance annotations, and category-specific metadata.

DTP-Synthetic: Generated in CARLA v0.9.15 (maps Town01–07, Town10HD, Town12, Town15), scenarios are randomized in terms of time (hours 6:00–18:00) and weather (Clear, Rain, Wet, Clouds, Fog). A single 90° FOV camera mounted on a standard “ego-vehicle” at 1.4 m height captures scenes after objects are spawned at exact distances. Each object instance is placed at one of the six target distances (d{5,10,20,30,40,50}d \in \{5, 10, 20, 30, 40, 50\} m), with minor Gaussian noise (σ = 1.2 m where relevant). Images and associated metadata are subject to manual validation, ensuring high-quality, artifact-free samples.

DTP-Real: Frames are sourced from nuScenes (Motional, 2019), using front and surround RGB cameras. Object distances (dexactd_{exact}) are computed from nuScenes’ 3D bounding box annotations and binned to the nearest of {5,10,20,30,40,50}\{5, 10, 20, 30, 40, 50\} m for inclusion (excluding samples where dexact>55d_{exact} > 55 m). Sample selection ensures balanced per-category and per-distance answer distributions, exploiting nuScenes metadata such as visibility and object class.

Annotation Schema: Each dataset sample is stored as a JSON entry containing: image path, question, answer, discrete ("distance") and, if applicable, floating-point ("precise_distance") metrics, with fields for town, weather, and other category-specific details.

Distance Binning: The binning strategy minimizes dexactb|d_{exact} - b| over b{5,10,20,30,40,50}b \in \{5,10,20,30,40,50\} m. Samples exceeding 55 m are omitted.

3. Category Structure and Question Taxonomy

DTPQA questions are chosen to be trivial in semantic content yet crucial for safety and decision-making:

Category Question Template Answer Set
Single-pedestrian presence Are there any pedestrians crossing the road? {Yes, No}
Pedestrian direction Is the pedestrian walking to the left or right? {Left, Right}
Multiple-pedestrian count How many people are ahead? {0,1,2,3}
Indicator (blinker) side Which turn signal is active on the truck ahead? {Left, Right, None}
Traffic-light colour (synthetic) What colour is the traffic light? {Red, Yellow, Green}
Traffic-sign type (synthetic) What is the shown road sign? {SpeedLimit, Stop, …}

DTP-Synthetic includes all six categories, while DTP-Real comprises the four categories possible given nuScenes annotation coverage. Each distance bin and category combination is balanced across all answer options, except for a minor, statistically insignificant 12/11/11 split in Cat.5-Synth at 10 m.

4. Dataset Statistics

Distribution per distance bin and modality is as follows:

Distance (m) DTP-Synth DTP-Real Combined
5 1,350 379 1,729
10 1,434 1,280 2,714
20 1,556 865 2,421
30 1,556 2,329 3,885
40 1,556 2,310 3,866
50 1,556 2,218 3,774
Negative 360 400 760
Total 9,368 9,781 19,149

Each sample’s representation allows for precise calibration of VLM performance as a function of distance. Negative cases (object absent) further ensure robustness to object misdetection.

5. Evaluation Protocol and Baselines

Benchmark evaluation proceeds via standard accuracy (ACC) and chance-corrected accuracy (CCA):

CCA=ACCpchance1pchance×100%\mathrm{CCA} = \frac{\mathrm{ACC} - p_{\mathrm{chance}}}{1 - p_{\mathrm{chance}}} \times 100\%

where pchance=1/AnswerSpacep_{\mathrm{chance}} = 1 / |\text{AnswerSpace}|. Analyses stratify ACC and CCA by distance bin.

Ten baseline VLMs were assessed: nine state-of-the-art “small” models (e.g., BLIP-2, InstructBLIP, MiniGPT-4, several LLaVA variants) and one large model. All were evaluated zero-shot, with no task-specific tuning. Prompts followed the canonical: “Q: <question>\nA: (a) <opt1> (b) <opt2> … Which is correct?”

Results reveal that humans reach ≥ 95% CCA across all categories. VLMs show CCA values between 20% and 80% (task-dependent), with spatial categories (e.g., pedestrian direction, blinker side) demonstrating particular degradation at ≥ 30 m. Performance generally decreases linearly with log-distance: for example, ACC at 5 m is approximately 80–90%, falling to 30–40% at 50 m. Performance is modestly higher on synthetic data, with real images being more challenging due to greater scene complexity and variable lighting.

6. Data Generation Pipeline and Extension

All dataset generation scripts are released. The repository comprises:

  • dtp_synth/: CARLA-based synthetic generator; includes configuration (map list, weather presets, distance bins), main pipeline (generate_synth.py), and utility functions for precise object placement.
  • dtp_real/: nuScenes annotation and filtering; scripts for keyframe selection, distance binning, and sample export (extract_real.py, bin_distances.py).
  • Output structure stores JPEGs and a master annotation JSON.

Extending the benchmark is facilitated via:

  • Addition of new question categories or object classes with custom routines in the respective pipelines.
  • Modification of distance bins via configuration changes.
  • Adjustment of weather or time-of-day conditions through configuration files, with all such parameters recorded at the sample level.

7. Insights, Limitations, and Future Directions

Current DTPQA tasks emphasize basic perception cues; extension to more complex scenarios—such as occlusion, adverse weather, and multimodal (e.g., RGB + LiDAR) fusion—remains open. DTP-Real’s range is inherently limited by nuScenes (≤ 55 m, 6 cameras), suggesting that inclusion of datasets like Waymo or Argoverse could diversify perspective and range. All published VLM results to date are strictly zero-shot; while fine-tuning on DTPQA could significantly improve results, it may introduce overfitting to dataset-specific biases.

Additional annotations (e.g., weather, town for synthetic; variance in precise distance for real) enable secondary analyses on weather robustness and sensitivity to object position. Anticipated extensions include evaluations with dynamic video-VQA, robust prompt-engineering protocols, and integration of other sensor modalities (Theodoridis et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Distance-Annotated Traffic Perception Question Answering (DTPQA).