PVRBench: Robust Video Reasoning Benchmark

Updated 17 March 2026

PVRBench is a large-scale benchmark that evaluates video-language models’ robustness under realistic visual perturbations, including lighting, camera motion, weather, and occlusion.
It merges indoor and outdoor datasets from UrbanVideo-Bench and VSI-Bench to cover 27 tasks with 9,000 videos and 52,000 QA pairs, enabling diverse evaluations.
The benchmark employs systematic perturbation protocols with controlled severity parameters to quantify drops in accuracy and reasoning quality, highlighting vulnerabilities in current models.

PVRBench is a large-scale video benchmark designed to rigorously assess the real-world robustness and reasoning quality of embodied video-LLMs (video-LLMs) under realistic visual perturbations. Originating from the study "Are Video Reasoning Models Ready to Go Outside?" (He et al., 11 Mar 2026), PVRBench addresses the empirical gap between model performance in controlled settings and deployment scenarios characterized by lighting changes, camera motion, weather phenomena, and occlusion.

1. Benchmark Construction and Dataset Composition

PVRBench is constructed by merging two embodied video-reasoning benchmarks: UrbanVideo-Bench and VSI-Bench. UrbanVideo-Bench provides 1,547 drone and simulator videos (∼135 hours) across urban outdoor contexts such as residential and waterfront districts, with videos sourced from real DJI Mini 4K flights in Guangdong, EmbodiedCity Unreal Engine simulation of Beijing, and AerialVLN. VSI-Bench contributes 288 egocentric indoor videos from ARKitScenes, ScanNet, and 3RScan, spanning scenes such as living rooms, kitchens, offices, and more.

The combined benchmark yields 9,000 videos and 52,000 question–answer (QA) pairs, systematically covering 27 tasks: 16 navigation and planning tasks (trajectory captioning, goal detection, action generation) and 11 spatial-reasoning tasks (size estimation, counting, route planning), encompassing both indoor and outdoor/urban navigation environments. All videos are zero-shot evaluated without train/val/test splits to ensure fixed, directly comparable splits across models.

Sub-benchmark	# Videos	Scenario	Tasks
UrbanVideo-Bench	1,547	Outdoor, Drone	16 Navigation
VSI-Bench	288	Indoor, Egocen.	11 Reasoning
Merged	9,000	Indoor+Outdoor	27 total

2. Perturbation Taxonomy and Mathematical Modeling

PVRBench introduces realistic, spatially aware, and temporally coherent corruptions simulating environmental disturbances frequently encountered in real-world applications. Four high-level corruption classes, each with multiple subtypes, are instantiated:

Lighting: Dusk, night, overexposure, directional shadow
Camera Motion: Translation (shake), zoom (scale jitter), rotation
Occlusion: Static (e.g., lens smudge), dynamic (e.g., passing objects)
Weather: Fog, rain, snow

Corruption is enacted through a two-phase process:

Temporal shuffling:

$\pi:\{1,\dots,T\}\rightarrow\{1,\dots,T\} \sim \mathrm{Uniform}$

Video frames are permuted to introduce temporal disorder.

Spatial corruption: For corruption style $m$ (lighting, camera, occ, weather), frame $f_t$ is masked via

$f_t' = f_t \odot P_t^{(m)}, \qquad P_t^{(m)} = B_t^{(m)} \odot C_t^{(m)}$

where $B_t^{(m)}$ is a binary structure selecting affected regions (e.g., occlusion or raindrops), and $C_t^{(m)}$ modulates effect strength per pixel.

Specific generative models are used for different effects:

Fog: Atmospheric scattering

$I'(x) = I(x) \exp(-\beta d(x)) + A (1-\exp(-\beta d(x)))$

Rain/Snow: Alpha-blended streak masks with orientation and strength modulation.
Shadow/Overexposure: Depth-aware pixel intensity remapping.
Camera Motion: Random affine transformations parameterized by translation, scaling, and rotation drawn from uniform intervals parameterized by severity $\eta$ .
Occlusion: Binary regions with area fraction $\eta$ covering random or semantically meaningful portions of the frame.

Combined space-time corruptions are defined as

$V' = \left\{ f_{\pi(t)} \odot P_t^{(m)} \right\}_{t=1}^T$

3. Protocols for Perturbation Severity and Application

Severity for each corruption is governed by a parameter $\eta \in \{0.5, 0.7, 0.9\}$ , selecting the middle value ( $\eta=0.7$ ) for all main benchmarking. For dynamic training protocols, $\eta$ values are sampled i.i.d. from Uniform $[0.5, 0.9]$ per iteration to mitigate overfitting. All parameterization details and value ranges are specified explicitly—for instance, fog density $\beta\sim U(0.003,0.01)$ , rain strength $\alpha_r\sim U(0.2,0.6)$ , occlusion area $\eta_{\mathrm{occ}}\sim U(0.1,0.4)$ , and camera motion translation $\Delta x, \Delta y\sim U(-\eta W, \eta W)$ .

Each video is rendered into four perturbed variants (one per corruption class) in addition to its clean baseline, ensuring every model is evaluated identically across all conditions.

4. Evaluation Methodology and Metrics

PVRBench assesses both answer robustness and the semantic quality of generated reasoning:

Answer Accuracy:

$\mathrm{Acc} = \frac{\#\{\mathrm{correct\,answers}\}}{\#\{\mathrm{questions}\}}$

Robustness Drop:

$\Delta_{\mathrm{Acc}} = \mathrm{Acc}_{\mathrm{clean}} - \mathrm{Acc}_{\mathrm{perturbed}}$

Reasoning Quality: Judged by LLMs using curated prompts, five metrics quantify semantic and logical fidelity (each $\in[0,5]$ $\in [0, 5]$ ):
- Fragility (Fra): Mean accuracy drop over corruptions
- Consistency (Con.): Semantic similarity between perturbed and clean reasoning (> ...</think> traces) > - Belief (Bel.): Confidence and logical coherence > - Recovery (Rec.): Adaptive acknowledgment of the perturbation > - Attention (Att.): Focus on relevant visuo-temporal evidence > > The overall reasoning average is > > $\frac{1}{4}\left[\mathrm{Con} + \mathrm{Bel} + \mathrm{Rec} + \mathrm{Att}\right]$ > > The evaluation protocol mandates fixed dataset splits and perturbation configurations for strict comparability, with no train/val/test split and all models evaluated zero-shot. > > ## 5. Empirical Results and Robustness Insights > > PVRBench reveals substantial declines in both accuracy and reasoning quality under realistic perturbations across proprietary and open-source video-LLMs: > > - Proprietary LLMs: GPT-4o, Gemini-3-Pro, Claude-3.5-Sonnet suffer 11–17 percentage point (pp) drops in accuracy and 10–14 pp in reasoning score. > > - Research Video-Reasoners: Video-R1, VideoChat-R, LLaVA-Video-R, Embodied-R degrade up to 22 pp in accuracy and 20 pp in reasoning. > > - Open-Source Video-LLMs: LLaVA-Video, VideoLLaMA2, VideoChat2, MiniCPM-V, InternVL2.5, Qwen2.5-VL, Qwen3-VL demonstrate maximal drops of 35 pp in accuracy and 28 pp in reasoning. > > | Model | Clean Acc. | Avg Pert Acc. (↓) | Reasoning Avg (↓) | > |----------------------|------------|-------------------|-------------------| > | GPT-4o | 0.59 | 0.51 (↓14%) | 3.39 (↓11%) | > | Embodied-R (7 B) | 0.54 | 0.42 (↓22%) | 2.78 (↓19%) | > | + ROVA (7 B) | 0.55 | 0.50 (↓9%) | 3.12 (↓13%) | > | Qwen2.5-VL (7 B) | 0.51 | 0.33 (↓35%) | 2.55 (↓25%) | > | + ROVA (7 B) | 0.53 | 0.47 (↓11%) | 2.99 (↓15%) | > > The ROVA framework, evaluated as an intervention, reduces the largest measured accuracy drop from 35 pp to 9 pp and yields ≥24% relative accuracy improvement and ≥9% relative gain in reasoning. This suggests robustness-aware training can partially mitigate—but not eliminate—the vulnerability of state-of-the-art video-LLMs to realistic spatio-temporal distortions. > > ## 6. Benchmarking Workflow and Analytical Procedures > > The standardized pipeline for PVRBench evaluation involves: > > 1. Generating four corrupted variants for each video, targeting Lighting, Occlusion, Camera Shake, and Weather. > > 2. Running inference with each candidate model on both clean and perturbed videos. > > 3. Extracting final answers and reasoning traces (marked with <think>...).

Computing accuracy and reasoning metric profiles per model–corruption–task tuple.

This workflow directly quantifies the degradation of both answer correctness and the fidelity of reasoning chains, thus providing a multidimensional landscape of model robustness. LLM-judge protocols for reasoning assessment utilize prompt engineering anchored in the constructs detailed in Appendix A.3.

7. Significance and Future Implications

PVRBench fills a critical gap by providing a provably consistent, diverse, and rigorously corrupted suite of video-reasoning tasks, exposing models to the spectrum of disturbances expected in real-world deployment. The explicit separation of corruption modalities and the introduction of zero-shot fixed evaluation protocols allow for precise cross-model and cross-benchmark comparability. A plausible implication is that future robust video-LLM development must prioritize not just answer accuracy but also reasoning trace reliability under complex sensory perturbation. PVRBench establishes a reproducible and extensible standard for evaluating embodied video reasoning in realistic settings, guiding future work in both model architecture and robustness-oriented training (He et al., 11 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Are Video Reasoning Models Ready to Go Outside? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PVRBench.