PhysVidBench: Evaluating Physical Commonsense in T2V

Updated 22 July 2025

PhysVidBench is a benchmark designed to assess text-to-video models' capacity for physical commonsense reasoning in everyday tasks.
It uses a structured, multi-stage pipeline with base and upsampled prompt generations to capture tool use, material properties, and procedural sequences.
The evaluation reveals persistent challenges in spatial and temporal reasoning, prompting future work in physically grounded pre-training and object-centric modeling.

PhysVidBench is a domain-specific benchmark designed to rigorously assess the physical commonsense reasoning capabilities of modern text-to-video (T2V) generation models. The benchmark moves beyond standard video realism or surface-level object manipulation by focusing on tool use, material properties, procedural sequences, and causal reasoning in everyday tasks—areas where physical plausibility is essential and where large T2V models have historically struggled.

1. Benchmark Structure and Design

PhysVidBench comprises 383 unique prompts, each distilled from the PIQA dataset—a resource known for capturing real-world, goal–solution physical interactions. The selection process was tailored to exclude rote or trivial examples, targeting instead those that require genuine physical reasoning about tool affordances and procedural outcomes. Gemini 2.5 Pro was used for lightweight pre-filtering, ensuring that retained prompts emphasize secondary tool usage or non-obvious affordances (e.g., using a credit card to scrape ice, using a paperclip to eject a SIM tray).

Prompts proceed through a two-stage refinement:

Base Prompt Generation: Goal–solution pairs are transformed into self-contained, visually demonstrable video prompts, grounded in observable physical scenarios.
Upsampled Prompt Generation: Gemini 2.5 Pro further enriches these base prompts, explicitly marking material details, causal mechanisms, manipulations, and affordances, but always maintaining closed-world integrity.

Each base prompt thus has an “upsampled” variant, providing a total of 766 prompts (383 pairs), allowing the evaluation to probe physical understanding at multiple representational fidelities.

2. Evaluation Pipeline and Methodology

PhysVidBench introduces a modular, multi-stage evaluation pipeline:

Physics-Grounded Question Generation: For every upsampled prompt, targeted yes/no questions are generated—on average, 11 questions per prompt—covering fundamental physics, affordance detection, spatial reasoning, temporal dynamics, procedural action, material transformation, and force/motion. There are 4123 questions in total. These are designed to isolate key physical principles and avoid ambiguous or speculative reasoning.
Dense Video Captioning: Each generated video is captioned automatically using the AuroraCap model. This produces one general description and seven dimension-specific captions, with each focused on a separate reasoning aspect (e.g., spatial layout, force propagation, material change).
LLM-Based Judging: The eight captions for each video are input to a LLM judge (Gemini-2.5-Flash-Preview-04-17), which receives the physics questions and must answer yes/no based exclusively on caption content, intentionally eschewing direct video analysis. A question is marked correct if at least one caption supports an affirmative answer.

The final accuracy for each video–question pair is the proportion of questions answered correctly, computed as an aggregated binary judgment across all prompts and dimensions.

3. Model Coverage and Performance Characteristics

PhysVidBench systematically benchmarks a wide spectrum of T2V models, including open-source and proprietary options (VideoCrafter2, CogVideoX 2B/5B, Wan2.1 1.3B/14B, MAGI-1, Hunyuan Video, Cosmos 7B/14B, Sora, Veo-2, LTX-Video). The evaluation reveals several consistent findings:

Model scaling and prompt upsampling both improve physical reasoning performance, but the effect is moderate, and absolute scores remain low, even for state-of-the-art models.
Spatial reasoning (e.g., evaluating whether the manipulated object correctly interacts with secondary items) and temporal dynamics (e.g., correct causal sequence of actions) are persistently challenging, often lagging behind more “surface” tasks like affordance display or object identification.
Surprising cases emerge where smaller models (such as CogVideoX-2B) occasionally outperform their larger counterparts, indicating that parameter scaling alone does not guarantee improvement in physical commonsense.

4. Challenges in Physical Commonsense Evaluation

Direct evaluation of generated videos using vision-LLMs (VLMs) is plagued by hallucinations and over-/under-detection errors, producing unreliable physical judgments. PhysVidBench’s indirect, caption-based evaluation pipeline addresses this by:

Leveraging dense, multi-perspective textual evidence instead of relying solely on the VLM’s interpretation of pixel data.
Disentangling surface realism from genuine reasoning by requiring that questions be answered strictly on the basis of captioned content, minimizing confirmation bias and artifacts associated with direct video–question pipelines.
Reducing false positives by aggregating over several captions and dimensions, increasing statistical robustness and interpretability.

This methodology highlights not only whether models can render visually plausible events, but whether these visualizations respect the deeper physics required by the prompt.

5. Implications and Future Directions

PhysVidBench reveals that while current models can frequently handle surface-level affordances or visually recognize some tools, their reasoning about nontrivial spatial arrangements, causal progression, and material properties remains underdeveloped. Authors identify several paths forward:

Temporal abstraction and object-centric modeling: Future systems may benefit from architectures that better encode object relations and temporal dependencies, enabling more authentic procedural and causal reasoning.
Physically grounded pre-training: Embedding physics-based objectives or experiences (e.g., through simulation environments) during model training could inject necessary inductive bias, moving models beyond mere pixel imitation.
Expanding commonsense domains: While PhysVidBench focuses on tool use and material interaction, extending into other areas (e.g., fluid dynamics, collision, force propagation in unfamiliar settings) is a promising avenue for comprehensive evaluation.

6. Significance and Distinction in the Benchmark Landscape

PhysVidBench stands out among modern physics-aware video benchmarks for its structured, interpretable approach focused on everyday procedural physical reasoning, especially tool use and affordances. Its three-stage pipeline—question generation, dimension-wise captioning, and indirect LLM-based judgment—enables reproducible, challenge-specific assessment, substantially mitigating issues of VLM hallucination and bias that confound other approaches.

The benchmark thus serves as an advanced diagnostic tool for the field, quantifying both progress and persistent limitations in physical commonsense reasoning across leading video generation models, and providing a clear platform for iterative improvement and future research.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PhysVidBench.