FinePhyEval: Fine-Grained Physical Reasoning Benchmark
- FinePhyEval is a suite of benchmarks that rigorously assesses AI's physical reasoning using detailed, decomposable evaluations of object existence, actions, and physical laws.
- It employs multimodal assessments from text, video, and interactive environments through PQSG, PhyX, and DeepPHY to diagnose errors in physics compliance.
- The framework provides granular diagnostic signals that reveal AI shortcomings in visual perception, action planning, and adherence to fundamental physical principles.
FinePhyEval is a term referencing a class of fine-grained, human- and model-centric physical reasoning benchmarks for evaluating artificial intelligence systems, particularly LLMs and vision-LLMs (VLMs), in their ability to understand, reason, and act according to the laws of physics. The FinePhyEval paradigm underpins both static (text/video/image) and dynamic (interactive, agentic) evaluation, providing granular diagnostic signals that are unattainable with coarse-grained traditional metrics. Contemporary instantiations include the PQSG-based video plausibility benchmark (Pothiraj et al., 24 Jun 2026), the large-scale multimodal physics Q&A corpus PhyX (Shen et al., 21 May 2025), and the simulated environment suite DeepPHY (Xu et al., 7 Aug 2025).
1. Motivation and Conceptual Framework
FinePhyEval emerged due to systematic failures of text-to-video, vision-language, and multimodal models on basic physical reasoning, even as these models excel at surface-level realism or general science tasks. Conventional metrics, such as FID or CLIPScore for video generation and simple overall accuracy for QA, are limited: they aggregate disparate errors into a single undifferentiated score and fail to localize the source of a model’s failure. Notably, superficially plausible outputs (e.g., a dissolving paper) may still be rated highly despite violating physics (Pothiraj et al., 24 Jun 2026). The primary aim of FinePhyEval is to deliver explicit, decomposable evaluation signals that separately target object existence, action validity, and adherence to physical laws across a range of scenarios (static visual scenes, dynamic video, and interactive control).
A central methodological unifier is the hierarchical analysis of physical plausibility. FinePhyEval frameworks leverage scene decomposition (objects, actions, physics), dependency-aware question graphs, or task-specific reward structures to annotate multidimensional model behaviors.
2. Dataset Construction and Properties
FinePhyEval datasets are explicitly designed for fine-grained and multimodal probing. The PQSG-based FinePhyEval dataset (Pothiraj et al., 24 Jun 2026) consists of 65 prompts derived from the Physics-IQ corpus, spanning solid mechanics, fluid dynamics, optics, thermodynamics, and magnetism. Each prompt is rendered into video form by three state-of-the-art models—OpenAI Sora 2, Google Veo 3, Wan 2.1—yielding a 195-video collection. Every prompt-video pair is annotated for object existence, action correctness, physics plausibility, and overall alignment using a 1–5 Likert scale. Human reliability is quantified via inter-annotator agreement metrics: ICC = 0.84 (“excellent”) and Krippendorff’s α = 0.59 (“moderate”).
PhyX (Shen et al., 21 May 2025), a large-scale multimodal benchmark (also referenced as FinePhyEval), includes 3,000 image-grounded physics questions distributed evenly across mechanics, electromagnetism, thermodynamics, wave–acoustics, optics, and modern physics. Each question aligns to one of six reasoning types: physical model grounding, spatial relation, multi-formula, implicit condition, numerical, and predictive. Instances cover both multiple-choice and open-ended configurations, and every problem is quality-controlled with manual Ph.D.-level review, text redundancy stripping, and duplication removal.
DeepPHY (Xu et al., 7 Aug 2025) operationalizes FinePhyEval in interactive, physics-rich simulated environments (e.g., PHYRE, Kinetix, Angry Birds). These environments probe perception, causal reasoning, iterative failure-driven adaptation, and precise action planning across six domains encompassing 2D block puzzles, billiards, sequential block removal, and more.
| Benchmark | Modality | Physical Domains | Granular Annotations |
|---|---|---|---|
| PQSG-FinePhyEval (Pothiraj et al., 24 Jun 2026) | Video/text | Mechanics, fluids, optics, thermo, magnetism | Object/action/physics Likert, DAG QA |
| PhyX (Shen et al., 21 May 2025) | Image/Q&A | Six university-level physics branches | 6 reasoning types, multimodal MC/Open-ended |
| DeepPHY (Xu et al., 7 Aug 2025) | Interactive/agent | Six simulated physical game environments | Reward, task-specific signals, POMDP data |
3. Evaluation Pipelines and Scoring Metrics
3.1 Hierarchical Question-Based Evaluation
The Physics Question Scene Graph (PQSG) pipeline (Pothiraj et al., 24 Jun 2026) underlies the video-based FinePhyEval. It operates in three stages:
- Question Generation (QG): A vision-LLM constructs a directed acyclic graph (DAG) of atomic verification questions reflecting object existence (O), action correctness (A), and physics plausibility (P). Dependencies enforce that actions can only be validated if objects exist, and physical checks proceed only if relevant actions occur.
- Question Answering (QA): Each node in the question graph is answered (yes/no) by a VLM or human based on the rendered video, with parent “no” responses auto-propagating “no” to all descendants.
- Aggregation: Binary per-node answers are collapsed into per-category scores:
with as the unweighted average across categories.
Correlation to human judgment is assessed via Pearson’s , Spearman’s , and Kendall’s .
3.2 Fine-Grained Physics QA (PhyX)
PhyX (Shen et al., 21 May 2025) models are evaluated under a standardized chain-of-thought prompt, followed by rule-based extraction of final answers and LLM (DeepSeek-V3) adjudication for open-ended problems. Metrics include:
- Overall accuracy:
- Macro-domain average: , with per domain.
3.3 Agentic Task Scoring (DeepPHY)
In DeepPHY (Xu et al., 7 Aug 2025), the agent’s proficiency is assessed through success rate, pass@K (first success by Kth trial), average attempts, and domain-specific metrics (e.g., mean star count, distance error). For instance:
where is the binary reward per trial.
4. Model Performance, Reliability, and Subtask Analysis
Empirical results consistently demonstrate that even the strongest foundation models underperform humans in fine-grained physical reasoning:
- PQSG-FinePhyEval (Pothiraj et al., 24 Jun 2026): On 195 videos, PQSG-scored correlations to human judgments were highest for GPT-5.5 QA (0) and Gemini-2.5-Pro QA (1), outperforming baselines such as VideoScore or Direct VQA. Model ranking of video generators reveals that closed-source Sora 2 and Veo 3 significantly surpass Wan 2.1 on action and physics plausibility, with the largest gap in physics scores.
- PhyX (Shen et al., 21 May 2025): On the Text-DeRedundancy Open-Ended split, best-performing human experts reach up to 78.9% accuracy, while GPT-4o achieves 32.5%, Claude-3.7-Sonnet 42.2%, and GPT-o4-mini 45.8%. The Multiple-Choice configuration increases raw model scores (best model 87%), but this inflates metrics by masking reasoning failures.
- DeepPHY (Xu et al., 7 Aug 2025): VLMs show pronounced struggles with continuous control, long-horizon planning, and rapid adaptation, with fine-grained per-task metrics highlighting deficits in collision physics, timing, and nuanced causal inference.
Subtask analysis in PQSG-FinePhyEval distinguishes between Question Generation (QG) and QA. Gemini-2.5-Pro and GPT-5.5 reach ≥92% precision/recall for QG, but QA remains a limiting factor (object QA ~88%, action/physics ~60–65%) compared to the human upper bound of 100%. This suggests VLMs excel in explicit decompositional reasoning but struggle with visual perception and physics abstraction under dynamic constraints.
5. Comparison with Prior and Alternative Benchmarks
Previous benchmarks such as UGPhysics, PHYBench, and OlympiadBench are primarily text-based and lack joint visual–symbolic reasoning (Shen et al., 21 May 2025). Baselines relying on pre-trained image/video similarity metrics (e.g., CLIPScore, FID) do not distinguish physically plausible but visually nonconforming outputs or vice versa (Pothiraj et al., 24 Jun 2026). Classical agentic benchmarks, such as Atari or GUI environments, over-simplify physics or focus on abstract rule-following, not real-world mechanics (Xu et al., 7 Aug 2025).
FinePhyEval benchmarks overcome these deficiencies through:
- Explicit annotation of physical law violations;
- Hierarchical evaluation that distinguishes object, action, and physics errors;
- Multi-modality (text, image, video, direct environment interaction);
- Subtask-level reporting (e.g., QG, QA, open/closed format, CoT rationale).
A plausible implication is that FinePhyEval can guide development not only of more physically robust generative models, but also of architectures that unify symbolic, perceptual, and causal reasoning.
6. Reproducibility and Best Practices
Datasets and evaluation scripts for FinePhyEval have adopted industry-standard pipelines. PhyX and DeepPHY release code and data under permissive licenses with one-click integration into VLMEvalKit, standardizing the chain-of-thought inference, answer extraction, and domain-wise reporting (Shen et al., 21 May 2025, Xu et al., 7 Aug 2025).
Recommended practices include:
- Overlaying object annotations to decouple low-level vision from reasoning;
- Enforcing discrete, structured action/observation formats;
- Keeping histories of environment-agent interactions in agentic tasks;
- Applying low-temperature inference for reproducibility;
- Domain-specific visualization to expose failure patterns.
7. Impact and Future Directions
FinePhyEval benchmarks—across textual, visual, and agentic modalities—constitute a critical foundation for diagnosing and improving model grounding in the real world. They make explicit the multifaceted failures of current models in physical understanding. High model error rates in fine-grained diagnostics, especially regarding causality and implicit conditions, highlight the need for advances in integrated scene understanding, improved grounding between perception and symbolic knowledge, and new architectures for coherent multi-step reasoning.
Continuous extension—in the form of new physical environments, larger prompt diversity, and finer-grained annotation schemas—remains essential for closing the gap between model and human reasoning in the physical sciences (Pothiraj et al., 24 Jun 2026, Shen et al., 21 May 2025, Xu et al., 7 Aug 2025).