FinePhyEval Dataset
- FinePhyEval is a benchmark dataset designed to assess the physical plausibility of text-to-video outputs using standardized, physics-based prompts.
- The dataset employs a hierarchical PQSG pipeline to evaluate object fidelity, action accuracy, and physics plausibility with detailed human and automated annotations.
- Covering phenomena such as gravity, fluid dynamics, optics, and thermodynamics, FinePhyEval facilitates granular model comparisons and sub-task diagnostics.
FinePhyEval is a benchmark dataset designed to facilitate fine-grained, reproducible evaluation of physical plausibility in text-to-video generation, especially for scenarios that challenge video models’ understanding of basic physical laws. Introduced to validate and support the Physics Question Scene Graph (PQSG) hierarchical evaluation pipeline, FinePhyEval provides a curated set of physics-based prompts, generated videos from multiple state-of-the-art models, and extensive human and machine annotations. Source prompts are derived from the Physics-IQ benchmark and span phenomena such as gravity, fluid dynamics, optics, and thermodynamics. Each prompt is paired with corresponding video outputs from closed-source (Sora 2, Veo 3) and open-source (Wan 2.1) models, enabling direct comparison of physical realism, object fidelity, and action faithfulness under controlled evaluation settings (Pothiraj et al., 24 Jun 2026).
1. Design and Scope
FinePhyEval was constructed to address the scarcity of granular, localizable evaluation resources for generated video content, particularly in discerning whether results obey physical laws and accurately execute prompt-specified objects and interactions. Its main goals are:
- To benchmark diverse text-to-video models on physics-heavy tasks using standardized, physics-challenging prompts.
- To enable both coarse Likert-scale scoring and fine-grained, graph-based analysis of video outputs.
- To rigorously measure alignment between automated PQSG metrics and human judgment for object, action, and physics dimensions.
- To provide a basis for sub-task evaluation of PQSG stages, specifically Question Generation (QG) and Question Answering (QA).
Key dataset statistics:
| Attribute | Value |
|---|---|
| Physics-based prompts | 65 (from Physics-IQ, Motamed et al., 2025) |
| Generated videos | 195 (65 prompts × 3 models) |
| Video models | Sora 2, Veo 3, Wan 2.1 |
| Avg. video length | 4.39 s |
| Resolutions / fps | Sora 2: 720×1080 @ 30; Veo 3: 1280×720 @ 24; Wan 2.1: 1280×720 @ 16 |
2. Construction Methodology
Prompt Selection
All physics prompts are sourced verbatim from the Physics-IQ benchmark and are explicitly constructed to elicit interactions that probe models’ understanding of gravity, collisions, fluid dynamics, optics, thermodynamics, and magnetism.
Video Generation
For each prompt, three videos are generated using distinct text-to-video models: Sora 2 (OpenAI), Veo 3 (Google), and Wan 2.1 (open-source). Cosmos-Predict2.5-14B videos were also produced for exploratory analyses but are not part of the core annotated set.
Annotation Protocol
Annotations are produced by eight non-author human judges per video, using strict protocols:
- Likert-scale judgments (1–5) for four categories: object fidelity, action accuracy, physics plausibility, and overall alignment.
- No partial credit for hallucinated or missing objects/actions.
- Physics violations assessed independently of text-prompt alignment.
- For 20 prompts, expert annotators hand-write ground-truth PQSG question sets.
- For a 30-prompt/video subset, human binary QA answers are collected, totaling 444 QA judgments.
3. Dataset Content and Structure
Physical Phenomenon Coverage
| Phenomenon | # Prompts |
|---|---|
| Solid Mechanics | 38 |
| Fluid Dynamics | 15 |
| Optics | 8 |
| Thermodynamics | 3 |
| Magnetism | 2 |
The majority of prompts probe solid mechanics and fluid dynamics, with coverage also for optics and less frequent physical subfields.
PQSG Hierarchical Structure
FinePhyEval is tightly integrated with the PQSG system for localizing physical failures. Each video is annotated with a scene-graph of questions partitioned into:
- Object Existence (O₁, O₂, …): presence and attributes of scene objects.
- Action Verification (A₁, A₂, …): correct realization of prompt-specified verbs or events.
- Physics Plausibility (P₁, P₂, …): adherence to implied unmentioned physical laws.
Directed acyclic edges encode dependencies, e.g., Object → Action → Physics, enforcing that descendant questions are valid only if ancestors are satisfied.
Example PQSG Entry (abridged)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{
"nodes": {
"object_existence": {
"O1":"Is there a glass beverage dispenser?",
"O2":"Is a drinking glass present?",
"O3":"Is orange juice visible at start?"
},
"action_verification": {
"A1":"Does the dispenser release water?",
"A2":"Does the water enter the glass?"
},
"physics": {
"P1":"Does the water stream form a continuous column?",
"P2":"Do splashes occur at the impact?"
}
},
"edges": [
{"from":["O1"],"to":"A1"},
{"from":["O2","A1"],"to":"A2"},
{"from":["A1"],"to":"P1"},
{"from":["A2"],"to":"P2"}
]
} |
Human Annotation Schema
Each video receives Likert ratings for:
- Object fidelity
- Action accuracy
- Physics plausibility
- Overall prompt alignment
1600 scores are collected across the 195 videos and 4 categories. Inter-annotator agreement is high (ICC = 0.84 on a 50-video subset, Krippendorff’s α = 0.59).
| video_id | object_score | action_score | physics_score | overall_score |
|---|---|---|---|---|
| sora2_001 | 5 | 4 | 4 | 4 |
| veo3_001 | 5 | 4 | 3 | 4 |
| wan2.1_001 | 4 | 2 | 1 | 2 |
4. Evaluation Metrics
Fine-grained PQSG Scoring
- Overall PQSG score:
- Category PQSG score: e.g., Physics = (number of "yes" on physics nodes) / (physics nodes with all parents "yes")
Correlation to Human Judgment
- PQSG (human QA answers): Pearson’s
- PQSG (GPT-5.5 QA): Pearson’s
- Baseline video evaluation metrics, e.g., VideoScore, VideoPhy-2, PhyGenEval, achieve
Subtask Metrics
- QG Precision and Recall: Gemini-2.5-Pro—95.2%, GPT-5.5—Precision 92.0%, Recall 99.6%
- QA accuracy (GPT-5.5): Object 88.4%, Action 63.4%, Physics 64.6%
5. Integration with PQSG Pipeline
FinePhyEval is the primary evaluation ground for the PQSG pipeline. Its utilization includes:
- Providing paired prompt–generated videos for PQSG’s QG and QA stages.
- Supplying human-annotated PQSG question graphs and binary answers as evaluation upper bounds or ground truth.
- Validating the logical dependency structure encoded in PQSG, where, for example, an object’s non-existence (e.g., “no” for O₁) propagates to invalidate further action or physics queries.
PQSG question generation proceeds by prompting a VLM with task instructions, a fully-annotated exemplar, and the new prompt; output is a PQSG hierarchy in JSON. The system encodes edge dependencies as a directed acyclic graph, e.g., .
6. Usage Protocols and Limitations
Researchers evaluating novel text-to-video models using FinePhyEval are recommended to:
- Generate videos for each of the 65 prompts under matched rendering conditions.
- Apply PQSG QG and QA once per prompt-video pair (optionally averaging QG over multiple runs).
- Report both overall and per-category PQSG scores, with category breakdowns.
- Collect a human QA subset to calibrate automated QA performance.
Limitations include PQSG's focus on interactions and phenomena entailed by the prompt; “free-form” or spontaneous physics beyond the prompt are not evaluated. Automated VLM-based QA achieves high object accuracy but lower reliability in action and physics categories, and exhibits known answer biases (e.g., “yes-bias”). For maximal reliability, PQSG evaluation should be supplemented with human QA judgments. The pipeline is VLM-dependent; while it is model-agnostic, open-source VLM performance can lag proprietary systems.
7. Significance and Applications
FinePhyEval addresses the gap in empirical, interpretable benchmarks for physical law adherence in generative video systems. By correlating automated multi-level PQSG scores with expert judgments, it supports both direct model comparison and sub-task-level diagnostics. The dataset enables the reproducible, scalable evaluation of prompt fidelity and physical correctness, contributing significant infrastructure to the assessment and development of physics-aware generative video models (Pothiraj et al., 24 Jun 2026).