PhysRVGBench: Physics-Aware Visual Benchmark
- PhysRVGBench is a comprehensive evaluation framework that measures physical realism in both video generation and 3D multi-view reconstruction tasks.
- It employs detailed metrics such as IoU, trajectory offset for videos, and photometric and geometric measures for 3D reconstruction under authentic degradation conditions.
- The benchmark leverages real-world degradations and reinforcement learning-based physics enforcement to highlight and improve model robustness and physical fidelity.
PhysRVGBench refers to two distinct but thematically related benchmarks emerging in visual computing: (1) a dedicated evaluation suite for physics-aware video generative models with precise rigid-body motion assessment, and (2) a physically degraded real-capture 3D benchmark for multi-view visual restoration and reconstruction. Both target the quantification of physical realism and robustness under fundamental real-world constraints, explicitly addressing the shortcomings of existing benchmarks in enforcing or evaluating physical laws and handling real physical degradations in visual generation and reconstruction tasks (Zhang et al., 16 Jan 2026, Liu et al., 29 Dec 2025).
1. Motivation and Rationale
Modern generative or reconstructive visual systems, especially those based on transformers and deep diffusion models, demonstrate compelling visual fidelity yet typically fail to capture or preserve global physical constraints—such as Newtonian rigid-body dynamics or robustness to photometric and geometric degradations. PhysRVGBench is introduced to comprehensively benchmark these aspects for two major problem classes:
- Video Generation: Addressing the absence of explicit physics enforcement in text-to-video (T2V) and video-to-video (V2V) pipelines, particularly for rigid-body phenomena, by providing metrics and data supporting the quantitative assessment of physically plausible motion (Zhang et al., 16 Jan 2026).
- 3D Multi-view Reconstruction: Quantifying the impact of real-world degradations (illumination, scatter, occlusion, blur) on the robustness of reconstruction and novel view synthesis (NVS) pipelines, contrasting performance with synthetic corruptions and revealing failure modes that undermine geometric or photometric consistency (Liu et al., 29 Dec 2025).
This dual use of the PhysRVGBench name highlights a convergence in benchmarking priorities: the need for rigorously designed, physics-grounded datasets and protocols tailored to the specifics of both generative video modeling and physically faithful multi-view 3D understanding.
2. Structure of PhysRVGBench for Video Generation
PhysRVGBench provides an evaluation framework specifically constructed for assessing physics-aware video generative models (Zhang et al., 16 Jan 2026). It comprises 700 RGB video clips (480 × 832 px, 30 fps, 49 frames) with:
- Annotated first-frame coordinates for object(s): one coordinate for single-body (free fall, pendulum, rolling), two for collisions.
- Automatically extracted fine-grained motion masks () using SAM2, triggered with ground-truth coordinate seeds.
- Derived center-of-mass 2D trajectories () computed via a center operator on the motion masks.
- Explicit identification of rigid-body motion regimes: collision (two bodies), pendulum, free fall, and rolling, with diverse backgrounds, materials, and scenes (synthetic/real, indoor/outdoor).
- Approximately 650 training and 50 held-out evaluation clips.
Models are conditioned on a text prompt indicating rigid-body motion and a five-frame context (), tasked to autoregressively generate the remaining 44 frames.
3. Structure of PhysRVGBench for 3D Multi-view Reconstruction
Inspired by RealX3D (Liu et al., 29 Dec 2025), PhysRVGBench in the 3D context defines a real-capture benchmark for multi-view visual restoration and reconstruction under authentic physical degradations. Key features include:
- Exposure to four degradation families (illumination, scattering, occlusion, blur) with multiple, directly controlled severities per type.
- Pixel-aligned low-quality (LQ) and ground-truth (GT) view pairs for each camera pose, acquired along matched trajectories with a programmable rail-dolly system.
- Per-view 14-bit RAW images for LQ, high-exposure 16-bit sRGB for GT, and dense, metrically calibrated geometry from high-precision laser scanning (Leica BLK360 G2).
- Comprehensive camera calibration (Sony A7 IV, fixed intrinsics), COLMAP SfM pose estimation (performed solely on clean GT), and registration of reconstructed point clouds to the physical mesh with global transforms.
- Authentic degradation modeling rather than synthetic: e.g., real smoke for scattering, physical occluders or glass plates (for dynamic or reflective occlusion), actual misfocus/motion for blur with quantifiable parameters.
- Complete data allows per-pixel ground-truth depth by rasterizing registered mesh geometry into each camera frame.
- Each degradation conforms to explicit mathematical models for direct transmission, additive radiance, blur, and noise.
4. Evaluation Protocols and Quantitative Metrics
The suite is designed for both qualitative and quantitative assessment of physical fidelity. For video generation (Zhang et al., 16 Jan 2026):
- Intersection-over-Union (IoU): Measures mean mask overlap:
where and denote spatial overlap and union between generated and ground-truth motion masks.
- Trajectory Offset (TO): Average Euclidean deviation of predicted and ground-truth object centers:
- Collision-Weighted Offset: Weighted TO accounting for collision-critical frames, with frame weights emphasizing collision and adjacent frames.
- RL Fine-tuning Reward: Final reward for reinforcement learning is set as .
Additional metrics include VBench (visual consistency, motion smoothness, temporal flicker, etc.) and VideoPhy-2 (semantic alignment, physical consistency).
For 3D reconstruction (Liu et al., 29 Dec 2025):
- Photometric metrics: PSNR, SSIM, and LPIPS, reported for both observed and unseen views.
- Geometric metrics: Mean absolute depth error, pose error (AUC@5°, 10°, 20°), and surface correspondence (accuracy, completeness, F1 at 5 cm threshold).
- Point Cloud Quality: Per-view and aggregate metrics for global structure and local detail preservation under degradations.
5. Data Acquisition and Processing Pipeline
Video Generation:
- Motion masks are extracted per frame with SAM2, seeded by manual first-frame inputs.
- Trajectories are computed as the mean pixel within each motion mask.
- Collision events are identified as peaks in per-frame acceleration, derived from numerical velocity and acceleration using , , peaks detected with SciPy's
find_peaks. - Environment setup and metric scripts are provided for full reproducibility, including preprocessing and batch evaluation.
3D Reconstruction:
- Programmable cart-based acquisition ensures pixel-exact LQ/GT pairs.
- Laser scanning and HDR capture enable dense, metric ground-truth geometry.
- Unified calibration and pose alignment procedures guarantee strict registration of all modalities.
- Real degradations are introduced via controlled lighting, fog machines, occluders, and mechanical lens manipulations.
- RAW sensor measurements support precise radiometric modeling, with calibration fits from reference charts.
6. Benchmarking Results and Comparative Analysis
For video generation (Zhang et al., 16 Jan 2026), baseline results on the evaluation set (IoU↑/TO↓):
| Model | IoU | TO |
|---|---|---|
| Wan2.2–5B (I2V) | 0.15 | 162.78 |
| Kling2.5 (I2V) | 0.23 | 103.22 |
| Magi-1 (V2V) | 0.27 | 113.42 |
| PhysRVG | 0.64 | 15.03 |
- V2V models outperform I2V when leveraging context frames more effectively.
- PhysRVG, with physics-aware RL and the Mimicry-Discovery Cycle (MDcycle), achieves a 2× boost in IoU and >7× reduction in TO relative to the best prior V2V competitor, substantiating the effectiveness of RL-based physical law enforcement.
For 3D reconstruction (Liu et al., 29 Dec 2025), physical degradations severely impact both optimization-based (NeRF, Gaussian Splatting) and feed-forward foundation models, with real scattering, blur, and low light causing pronounced performance declines (e.g., best PSNRs for strong smoke or blur are <11 or 21 dB, respectively; F1 scores drop by up to 2×). This suggests current pipelines lack robustness when confronted with authentic physical phenomena.
7. Implications, Limitations, and Availability
PhysRVGBench for both video generation and 3D reconstruction sets new standards for evaluation under real physical constraints. For video generation, it uniquely enables the measurement of Newtonian law internalization rather than data-driven distributional matching. For 3D reconstruction, pixel-aligned, real-capture LQ/GT pairs across major degradation scenarios allow for the precise diagnosis of model vulnerabilities.
A plausible implication is that the use of physics-driven training feedback and physically grounded benchmarks will become essential for progress in both video synthesis and 3D geometric understanding.
The evaluation scripts, subset of evaluation data, and pretrained configurations for video models are to be made publicly accessible at https://lucaria-academy.github.io/PhysRVG/ (Zhang et al., 16 Jan 2026). The protocols for real-capture and metric depth recovery are detailed in (Liu et al., 29 Dec 2025).
References:
- PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models (Zhang et al., 16 Jan 2026)
- RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction (Liu et al., 29 Dec 2025)