R4D-Bench: 4D VQA Benchmark

Updated 22 December 2025

R4D-Bench is a region-level 4D video question answering benchmark that employs explicit region masks and tokens to evaluate multimodal models' spatial and temporal reasoning.
It integrates automated region extraction with human verification to ensure high-fidelity pixel-level segmentation and precise region-token assignments.
The benchmark features both static and dynamic tasks, covering spatial relations, movement analysis, and numerical queries through a multiple-choice question format.

R4D-Bench is a region-level 4D Video Question Answering (VQA) benchmark designed to evaluate 3D spatial and temporal reasoning in multimodal LLMs (MLLMs). By providing precisely annotated region masks and region-indexed multiple-choice questions for dynamic video scenes, R4D-Bench addresses core limitations in prior 3D/4D VQA benchmarks, particularly the lack of explicit spatial grounding and dynamic, object-specific queries. It enables rigorous assessment of models’ region-tracking, spatial localization, and temporal comprehension abilities across a suite of static and dynamic reasoning tasks (Yang et al., 18 Dec 2025).

1. Objectives and Benchmark Scope

R4D-Bench was constructed to facilitate evaluation of a model’s ability to localize user-specified regions and answer complex numerical, relational, and dynamic questions about those regions over time. Unlike prior 3D/4D VQA datasets, which focus on free-form scene queries and lack precise region referents, R4D-Bench supplies both 2D segmentation masks (defining region-of-interest on the first frame) and corresponding region tokens in all task prompts. This enables the benchmark to probe not only spatial (depth, size, spatial relation) but also temporal (motion, ordering, speed, displacement) reasoning as they pertain to specific objects evolving throughout a video.

2. Dataset Construction and Grounding Pipeline

R4D-Bench is derived by transforming two established non-region 4D VQA datasets: STI-Bench and VLM4D. The construction pipeline integrates automated region extraction, multi-modal localization, and human verification:

Keyword Extraction: Each question is processed by a pretrained MLLM (Qwen-2.5VL), which wraps object/noun mentions in angle brackets.
Object Detection and Segmentation: If per-object ground-truth segmentation masks exist, they are adopted; otherwise, bounding boxes are generated using GroundingDINO and refined to pixel-level masks via SAM2.
Set-of-Marks (SoM) Visualization: Numbered markers are overlaid on the first frame at each object mask, generating a SoM image.
Automated Region-Token Assignment: The SoM image and bracketed question are input to Qwen-2.5VL to produce a JSON mapping between region tokens and object markers.
Human Verification: Label Studio is used by human annotators to review and correct all region-to-mask assignments.

This hybrid pipeline allows scalable, high-fidelity region-question grounding at scale (Yang et al., 18 Dec 2025).

3. Dataset Composition and Format

R4D-Bench comprises 780 distinct real-world video clips, with 30–300 frames per clip sampled at 10–30 FPS. All examples provide RGB video frames only (no additional depth or point cloud channels). Region annotations consist of one or more 2D masks per question, with all masks defined on the first frame. Questions are rewritten to reference explicit region tokens, e.g., “How did <R1> move <R2>?”, with corresponding region mask(s). The Q&A format is multiple-choice, with four to five candidate answers per question and a single correct response.

4. Region-Level Prompting and Task Taxonomy

Region-level prompts use discrete segmentation masks mapped to tokens (<R1>, <R2>, …) in the question text, enforcing explicit spatial grounding and requiring models to perform both localization and region tracking over the video.

The benchmark is divided into Static and Dynamic subsets:

Static tasks (object stationary, only camera motion):
- 3D Video Grounding (VG): Choose among candidate 3D boxes.
- Dimension Measurement (DM): Numeric size or inter-region distances.
- Spatial Relation (SR): Relational prepositions (left/right, front/back, up/down).
Dynamic tasks (object motion):
- Counting (C): Count region event occurrences.
- Translational Movement (T): Linear displacements.
- Rotational Movement (R): Angular motions.
- False Positive Detection (FP): Trick queries about non-occurring events.
- Speed & Acceleration Estimation (SA): Compute Δdistance/Δtime or Δvelocity/Δtime.
- Displacement & Path Length (DP): Path versus shortest-line displacement.

The presentation protocol samples $N$ frames (commonly $N=16$ ), shows the first frame with SoM overlay, and feeds the region-indexed question, requiring selection of one correct answer.

5. Evaluation Metrics and Protocol

The primary evaluation metric is multiple-choice accuracy over the set of $M$ region-prompted questions:

$\mathrm{Accuracy} = \frac{1}{M} \sum_{i=1}^{M} \mathbf{1}\bigl(\hat a_i = a_i^*\bigr)$

Here, $\hat a_i$ denotes the model’s selected answer for the $i$ -th question and $a_i^*$ is the ground-truth index.

For temporal localization subtasks (e.g., “at what time does X occur?”), temporal error can be reported as:

$\mathrm{TE} = \frac{1}{M}\sum_{i=1}^M |\,\hat t_i - t_i^*|$

Intersection over Union (IoU) can be measured for region mask segmentation, but R4D-Bench provides all masks and does not require mask prediction.

6. Comparative Analysis with Preceding Benchmarks

R4D-Bench introduces key innovations over previous 3D/4D VQA datasets:

Explicit Region Prompts: First to provide direct mask annotations and discrete region tokens, in contrast to free-form referring expressions (e.g., object color or name).
Dynamic Video: Includes real object movement, which is not present in image-based or camera-motion-only datasets (e.g., OmniSpatial, SAT).
Task Diversity: Unifies static grounding, relational, and measurement queries with dynamic counting, kinematics, and path analysis.
Scale: Encompasses 780 scenes and over 1,500 region-level QA pairs, situating it between small-scale image datasets and large video QA suites.

Dataset	Regions?	Input Type	#QA
SAT-real	✗	Images	150
VSTI-Bench	✗	Static Video	6K
STI-Bench	✗	Dynamic Video	2K
VLM4D-real	✗	Dynamic Video	1K
R4D-Bench	✔	Dynamic Video	1.5K

R4D-Bench uniquely supports both “dynamic video” input and explicit “region-prompted” queries (Yang et al., 18 Dec 2025).

7. Data Splits and Usage Guidelines

R4D-Bench is distributed as an evaluation-only benchmark, with no official training or validation splits. All 1,517 region-prompted QA pairs are designated for performance assessment. For model development or fine-tuning, an internal split of 70% train, 15% validation, and 15% test is recommended. Models must be evaluated with region tokens and ground-truth masks provided to ensure that performance gains result from true 4D reasoning rather than object name matching.

A plausible implication is that R4D-Bench may catalyze new region-level approaches, requiring explicit spatial grounding and joint reasoning over time, and serving as a critical yardstick for future multimodal LLMs focused on spatiotemporal understanding.

Markdown Report Issue Upgrade to Chat

References (1)

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to R4D-Bench.