FinePhyEval Dataset

Updated 27 June 2026

FinePhyEval is a benchmark dataset designed to assess the physical plausibility of text-to-video outputs using standardized, physics-based prompts.
The dataset employs a hierarchical PQSG pipeline to evaluate object fidelity, action accuracy, and physics plausibility with detailed human and automated annotations.
Covering phenomena such as gravity, fluid dynamics, optics, and thermodynamics, FinePhyEval facilitates granular model comparisons and sub-task diagnostics.

FinePhyEval is a benchmark dataset designed to facilitate fine-grained, reproducible evaluation of physical plausibility in text-to-video generation, especially for scenarios that challenge video models’ understanding of basic physical laws. Introduced to validate and support the Physics Question Scene Graph (PQSG) hierarchical evaluation pipeline, FinePhyEval provides a curated set of physics-based prompts, generated videos from multiple state-of-the-art models, and extensive human and machine annotations. Source prompts are derived from the Physics-IQ benchmark and span phenomena such as gravity, fluid dynamics, optics, and thermodynamics. Each prompt is paired with corresponding video outputs from closed-source (Sora 2, Veo 3) and open-source (Wan 2.1) models, enabling direct comparison of physical realism, object fidelity, and action faithfulness under controlled evaluation settings (Pothiraj et al., 24 Jun 2026).

1. Design and Scope

FinePhyEval was constructed to address the scarcity of granular, localizable evaluation resources for generated video content, particularly in discerning whether results obey physical laws and accurately execute prompt-specified objects and interactions. Its main goals are:

To benchmark diverse text-to-video models on physics-heavy tasks using standardized, physics-challenging prompts.
To enable both coarse Likert-scale scoring and fine-grained, graph-based analysis of video outputs.
To rigorously measure alignment between automated PQSG metrics and human judgment for object, action, and physics dimensions.
To provide a basis for sub-task evaluation of PQSG stages, specifically Question Generation (QG) and Question Answering (QA).

Key dataset statistics:

Attribute	Value
Physics-based prompts	65 (from Physics-IQ, Motamed et al., 2025)
Generated videos	195 (65 prompts × 3 models)
Video models	Sora 2, Veo 3, Wan 2.1
Avg. video length	4.39 s
Resolutions / fps	Sora 2: 720×1080 @ 30; Veo 3: 1280×720 @ 24; Wan 2.1: 1280×720 @ 16

2. Construction Methodology

Prompt Selection

All physics prompts are sourced verbatim from the Physics-IQ benchmark and are explicitly constructed to elicit interactions that probe models’ understanding of gravity, collisions, fluid dynamics, optics, thermodynamics, and magnetism.

Video Generation

For each prompt, three videos are generated using distinct text-to-video models: Sora 2 (OpenAI), Veo 3 (Google), and Wan 2.1 (open-source). Cosmos-Predict2.5-14B videos were also produced for exploratory analyses but are not part of the core annotated set.

Annotation Protocol

Annotations are produced by eight non-author human judges per video, using strict protocols:

Likert-scale judgments (1–5) for four categories: object fidelity, action accuracy, physics plausibility, and overall alignment.
No partial credit for hallucinated or missing objects/actions.
Physics violations assessed independently of text-prompt alignment.
For 20 prompts, expert annotators hand-write ground-truth PQSG question sets.
For a 30-prompt/video subset, human binary QA answers are collected, totaling 444 QA judgments.

3. Dataset Content and Structure

Physical Phenomenon Coverage

Phenomenon	# Prompts
Solid Mechanics	38
Fluid Dynamics	15
Optics	8
Thermodynamics	3
Magnetism	2

The majority of prompts probe solid mechanics and fluid dynamics, with coverage also for optics and less frequent physical subfields.

PQSG Hierarchical Structure

FinePhyEval is tightly integrated with the PQSG system for localizing physical failures. Each video is annotated with a scene-graph of questions partitioned into:

Object Existence (O₁, O₂, …): presence and attributes of scene objects.
Action Verification (A₁, A₂, …): correct realization of prompt-specified verbs or events.
Physics Plausibility (P₁, P₂, …): adherence to implied unmentioned physical laws.

Directed acyclic edges encode dependencies, e.g., Object → Action → Physics, enforcing that descendant questions are valid only if ancestors are satisfied.

Example PQSG Entry (abridged)

{
  "nodes": {
    "object_existence": {
      "O1":"Is there a glass beverage dispenser?",
      "O2":"Is a drinking glass present?",
      "O3":"Is orange juice visible at start?"
    },
    "action_verification": {
      "A1":"Does the dispenser release water?",
      "A2":"Does the water enter the glass?"
    },
    "physics": {
      "P1":"Does the water stream form a continuous column?",
      "P2":"Do splashes occur at the impact?"
    }
  },
  "edges": [
    {"from":["O1"],"to":"A1"},
    {"from":["O2","A1"],"to":"A2"},
    {"from":["A1"],"to":"P1"},
    {"from":["A2"],"to":"P2"}
  ]
}

Human Annotation Schema

Each video receives Likert ratings for:

Object fidelity
Action accuracy
Physics plausibility
Overall prompt alignment

1600 scores are collected across the 195 videos and 4 categories. Inter-annotator agreement is high (ICC = 0.84 on a 50-video subset, Krippendorff’s α = 0.59).

video_id	object_score	action_score	physics_score	overall_score
sora2_001	5	4	4	4
veo3_001	5	4	3	4
wan2.1_001	4	2	1	2

4. Evaluation Metrics

Fine-grained PQSG Scoring

Overall PQSG score: $(\sum_i \text{Yes}_i) / \text{Total valid questions}$
Category PQSG score: e.g., Physics = (number of "yes" on physics nodes) / (physics nodes with all parents "yes")

Correlation to Human Judgment

PQSG (human QA answers): Pearson’s $r\approx 0.80$
PQSG (GPT-5.5 QA): Pearson’s $r\approx 0.48$
Baseline video evaluation metrics, e.g., VideoScore, VideoPhy-2, PhyGenEval, achieve $r\in[0.27, 0.35]$

Subtask Metrics

QG Precision and Recall: Gemini-2.5-Pro—95.2%, GPT-5.5—Precision 92.0%, Recall 99.6%
QA accuracy (GPT-5.5): Object 88.4%, Action 63.4%, Physics 64.6%

5. Integration with PQSG Pipeline

FinePhyEval is the primary evaluation ground for the PQSG pipeline. Its utilization includes:

Providing paired prompt–generated videos for PQSG’s QG and QA stages.
Supplying human-annotated PQSG question graphs and binary answers as evaluation upper bounds or ground truth.
Validating the logical dependency structure encoded in PQSG, where, for example, an object’s non-existence (e.g., “no” for O₁) propagates to invalidate further action or physics queries.

PQSG question generation proceeds by prompting a VLM with task instructions, a fully-annotated exemplar, and the new prompt; output is a PQSG hierarchy in JSON. The system encodes edge dependencies as a directed acyclic graph, e.g., $O_1 \rightarrow A_1 \rightarrow P_1$ .

6. Usage Protocols and Limitations

Researchers evaluating novel text-to-video models using FinePhyEval are recommended to:

Generate videos for each of the 65 prompts under matched rendering conditions.
Apply PQSG QG and QA once per prompt-video pair (optionally averaging QG over multiple runs).
Report both overall and per-category PQSG scores, with category breakdowns.
Collect a human QA subset to calibrate automated QA performance.

Limitations include PQSG's focus on interactions and phenomena entailed by the prompt; “free-form” or spontaneous physics beyond the prompt are not evaluated. Automated VLM-based QA achieves high object accuracy but lower reliability in action and physics categories, and exhibits known answer biases (e.g., “yes-bias”). For maximal reliability, PQSG evaluation should be supplemented with human QA judgments. The pipeline is VLM-dependent; while it is model-agnostic, open-source VLM performance can lag proprietary systems.

7. Significance and Applications

FinePhyEval addresses the gap in empirical, interpretable benchmarks for physical law adherence in generative video systems. By correlating automated multi-level PQSG scores with expert judgments, it supports both direct model comparison and sub-task-level diagnostics. The dataset enables the reproducible, scalable evaluation of prompt fidelity and physical correctness, contributing significant infrastructure to the assessment and development of physics-aware generative video models (Pothiraj et al., 24 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinePhyEval Dataset.