Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation (2503.06800v1)

Published 9 Mar 2025 in cs.CV

Abstract: Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/.

Summary

Insights into VideoPhy-2: Evaluating Physical Commonsense in Video Generation

The paper "VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation" introduces VideoPhy-2, a benchmark designed to assess the physical commonsense capabilities of video generation models. By focusing on action-centric datasets and rigorous human evaluations, this work fills significant gaps in the current evaluation frameworks for video generative models, especially in the context of physical realism and semantic adherence.

VideoPhy-2 addresses several limitations in existing benchmarks by incorporating a comprehensive dataset of 3,940 diverse prompts detailing 200 distinct actions. This dataset is notably larger and more varied compared to previous efforts, such as VideoPhy, and emphasizes both object interactions and complex physical activities like sports. This extensive coverage allows for a robust evaluation of video generative models, challenging them to capture and reproduce real-world physical interactions and dynamics accurately.

Core Contributions

Several key contributions set VideoPhy-2 apart:

  1. Action-Centric Dataset: VideoPhy-2 is constructed with a focus on diverse real-world actions that test various physical laws. The dataset comprises prompts across physical activities, aiming to evaluate the ability of generative models to depict intuitive physics accurately.
  2. Physical Commonsense Focus: The benchmark goes beyond mere semantic alignment with text prompts by evaluating the physical plausibility of generated videos. Human annotators assess both semantic adherence and physical commonsense using a 5-point Likert scale, providing a nuanced review of model outputs.
  3. Physical Rules Annotations: For a more granular analysis, VideoPhy-2 introduces annotations for specific physical rules and laws (e.g., gravity, momentum) that should be followed by the models' outputs. This helps in identifying specific areas where models falter in replicating physical realism.
  4. Automatic Evaluation Tool: A notable addition is the introduction of VideoPhy-2-Autoeval, an automatic evaluator leveraging VideoCon-Physics for assessing semantic adherence, physical commonsense, and rule compliance. This tool demonstrates improved alignment with human judgments, facilitating scalable evaluations.

Numerical Findings and Challenges

The numerical results highlight substantial gaps in the current capabilities of video generation models. The best-performing model, Wan2.1-14B, achieves only 32.6% joint performance on the entire dataset, and 21.9% on a curated hard subset. This indicates significant room for improvement, especially in scenarios involving complex physical interactions. The paper identifies conservation laws, such as mass and momentum, as particularly challenging, being frequently violated in model outputs.

Models trained on diverse multimodal datasets, such as Wan2.1-14B, show relatively better performance, underscoring the need for rich training data encompassing a wide range of actions and interactions. Meanwhile, closed models like Ray2 lag behind due to perhaps a narrower focus in their training regimes.

Theoretical and Practical Implications

Theoretically, VideoPhy-2 advances the discourse on evaluating the physical reasoning capabilities of generative models, pushing the boundaries of current benchmarks by emphasizing action-centric scenarios. Practically, it serves as a guide for developing next-generation video models that can act as general-purpose world simulators—systems that accurately recapitulate physical realities in a variety of scenarios.

Future Directions

VideoPhy-2 opens new avenues for research in improving the physical commonsense abilities of generative models. Future work could focus on enhancing model architectures to better capture complex physical interactions and explore larger, more diverse datasets to train models effectively. Additionally, improving the integration of physical laws in model design and evaluation could lead to significant advancements in generating more realistic and physically coherent videos.

By shedding light on the gaps and opportunities in video generative models, VideoPhy-2 cultivates a rich testing ground for future innovations in AI, aiming for systems that more closely mimic the nuances of the physical world.