Physics-IQ Benchmark Overview
- Physics-IQ Benchmark is a domain-specific framework that evaluates AI on video prediction, experiment continuation, and process modeling in physics.
- It covers varied regimes including intuitive physics, multimodal reasoning, and formal research workflows to capture nuanced aspects of physics competence.
- The benchmark refines prompt quality, artifact control, and scoring methods to ensure reliable evaluation of AI systems' physics reasoning.
“Physics-IQ Benchmark” has two related meanings in current AI-for-science literature. In the narrow sense, it refers to the Physics-IQ benchmark for video generative models and its audited successor, “Physics-IQ Verified”, which evaluate whether a model can predict the continuation of real physical experiments rather than merely generate visually plausible motion (Rädsch et al., 17 Jun 2026). In a broader informal sense, the phrase is used for benchmarks that try to measure physics reasoning or physical understanding in AI systems—open-ended problem solving, multimodal diagram interpretation, intuitive-physics judgment, intervention in physical environments, or research-oriented workflow execution—rather than simple fact recall or multiple-choice recognition (Siddique et al., 31 Jul 2025, Wang et al., 21 Jun 2025, Zhu et al., 30 Sep 2025).
1. Meaning and scope of the term
In the broad benchmark taxonomy, “Physics-IQ” does not denote a standardized psychometric IQ instrument. The literature instead uses the phrase as shorthand for domain-specific physics competence: symbolic and numerical derivation, principle selection, conceptual explanation, multimodal grounding, or research-style reasoning under unfamiliar constraints. PhysicsEval is described as a strong candidate for what someone informally means by a “Physics-IQ Benchmark,” while CritPt is described as a benchmark of “entry-level research intelligence in physics” rather than a general intelligence test (Siddique et al., 31 Jul 2025, Zhu et al., 30 Sep 2025).
This distinction matters because benchmark families in the area target different notions of competence. Some works emphasize textbook and exam problem solving; some emphasize diagram-dependent multimodal reasoning; some target intuitive object physics in possible-versus-impossible videos; some test agentic or research-oriented workflows; and some move toward process-level or formal verification. PhysUniBench explicitly frames itself as close to “Physics-IQ” for multimodal AI, while the “Towards a Large Physics Benchmark” proposal argues that a physics benchmark should evaluate not only correctness but also difficulty and surprise, because existing science benchmarks are too shallow and too accuracy-centric (Wang et al., 21 Jun 2025, Barman et al., 29 Jul 2025).
A useful synthesis is that the phrase names a benchmark objective rather than a single fixed dataset: evaluating whether AI systems can reason within physics as a structured scientific domain. In practice, that objective has produced several distinct benchmark paradigms rather than one universal standard.
2. Major benchmark regimes
The literature suggests four recurrent benchmark regimes, each emphasizing a different layer of physics competence (Feng et al., 26 Mar 2025, Wang et al., 21 Jun 2025, Bakhtin et al., 2019, Barman et al., 29 Jul 2025, Zhao et al., 3 Oct 2025, Li et al., 30 Oct 2025).
| Regime | Representative benchmarks | Primary target |
|---|---|---|
| Open-ended academic problem solving | PhysicsEval, PHYSICS, TPBench, ABench-Physics, SymPyBench | Derivation, symbolic/numerical reasoning, robustness |
| Multimodal academic reasoning | PhysUniBench, Multi-Physics, SeePhys, HiPhO | Diagram interpretation, physics exams, visual grounding |
| Intuitive and interactive physics | IntPhys, IntPhys 2, PHYRE, Physics-IQ | Plausibility, object permanence, intervention, world modeling |
| Research, process, and formal reasoning | Towards a Large Physics Benchmark, PRL-Bench, CritPt, PRISM-Physics, Lean4PHYS | Research workflows, process scoring, formal verification |
These regimes differ not only in task format but also in what counts as a valid score. Open-ended exam benchmarks often use symbolic checkers, numerical tolerances, or rubric-based LLM grading. Multimodal academic benchmarks typically mix answer accuracy with figure sensitivity or step-level reasoning analysis. Intuitive-physics benchmarks evaluate discrimination between possible and impossible events, surprise under prediction, or sample-efficient intervention. Research-oriented and formal benchmarks add expert scoring, coding workflows, DAG-based process evaluation, or theorem-proving success.
A plausible implication is that no single benchmark currently covers the entire space that “Physics-IQ” informally evokes. Instead, the field has converged on a portfolio view: different benchmarks test different strata of physical intelligence.
3. The original Physics-IQ benchmark and “Physics-IQ Verified”
The exact benchmark name Physics-IQ appears in work on video generative models. In that setting, the benchmark evaluates whether a model can predict the continuation of a real physical process from partial context by comparing generated continuations against real recordings of controlled experiments. The benchmark consists of 66 distinct physical experiments spanning solid dynamics, fluid dynamics, thermodynamics, optics, and magnetism. Each experiment is recorded from three viewing angles and repeated twice, giving 396 videos in total. Each video lasts 8 seconds, split into a 3-second conditioning segment and a 5-second continuation (Rädsch et al., 17 Jun 2026).
The original benchmark compares generated and reference continuations using four metrics: Spatial IoU, Spatiotemporal IoU, Weighted Spatial IoU, and MSE. Its philosophy is reference-based rather than distributional: the question is not whether a video “looks real” in aggregate, but whether it predicts the correct continuation of a specific experiment. That makes it a benchmark for conditional physical prediction, not generic video realism.
“Physics-IQ Verified” is an audit-and-repair of that benchmark. It identifies three measurement problems in the original version: prompt quality problems, ground-truth video artifacts and spurious metric activations, and aggregation/scoring flaws. The audit reports that 69 evaluation videos had unclear prompts, 59 had artifacts, and 20 had both. It states that the revised benchmark improves over 34.8% of prompts, refines 57.6% of all samples, and influences 29.8% of videos (Rädsch et al., 17 Jun 2026).
The revised benchmark introduces three corresponding fixes: prompt improvement, artifact cleaning via end_effect_frames and freeze_areas, and a sample-level scoring system that weights each sample and metric equally. In a comparison study over six image-to-video models—Wan 2.2, HunyuanV-1.5, Cosmos3-Nano, Sora 2, P-Video, and Grok Imagine Video—the revised protocol produces moderate but meaningful ranking changes, with Kendall’s between original and verified rankings. That result is methodologically significant: benchmark curation details can materially alter conclusions about which model is “more physical.”
Within the exact-name lineage, then, “Physics-IQ” refers to world-model evaluation for video generation, and “Physics-IQ Verified” argues that prompt clarity, artifact control, and per-sample normalization are essential if such a benchmark is to provide a reliable physical-understanding signal.
4. Open-ended physics problem-solving benchmarks
A large branch of the literature uses “Physics-IQ” informally for open-ended problem-solving benchmarks. PhysicsEval is the most explicit example. It contains 19,609 problems, applies a 90:10 train-test split into 17,647 train and 1,962 test examples, marks the format as open-ended, and includes both mathematical and descriptive problems. It spans 19 categories, covers CEE + COL + COMP knowledge levels, and scores outputs using a Physics Proficiency Score (PPS) built from six rubric dimensions: Mathematical Accuracy, Logical Consistency, Formulas and Principles, Completeness, Assumptions Made, and Clarity and Coherence. Averaged over six models, multi-agent review improves PPS from 66.20 to 68.23 on hard problems, while self-refinement sometimes degrades performance. The paper also notes serious caveats: reference solutions were expanded by Gemini 2.5 Pro, only a small sample was reviewed, grading is LLM-based, and contamination risk from public textbooks and websites is not deeply addressed (Siddique et al., 31 Jul 2025).
PHYSICS targets university-level open-ended problem solving from physics PhD qualifying exams. It contains 1,297 expert-annotated problems, of which 298 are multimodal, across six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Its evaluation uses regex-based boxed-answer extraction, SymPy equivalence checking, and GPT-4o fallback when symbolic verification fails. Even the best model, o3-mini, reaches only 59.9% accuracy, which the paper presents as evidence that advanced foundation models remain substantially limited on high-level scientific reasoning (Feng et al., 26 Mar 2025).
ABench-Physics emphasizes advanced quantitative reasoning and robustness under controlled variation. It contains 500 carefully verified problems, split into Phy_A_fixed_400 and Phy_B_dynamic_100. Every problem requires a numerical final answer, and correctness uses the explicit rule
so the relative error must not exceed 1%. The dynamic subset perturbs numerical constants while preserving the physical model, and a template counts as correct only if all regenerated variants are solved. The best static score on Phy_A is 43.0%, and the paper reports an average static-to-dynamic decline of 22.5%, using that drop as evidence that many models remain brittle under controlled parameter changes (Zhang et al., 7 Jul 2025).
TPBench narrows the focus to theoretical physics, especially high-energy theory and cosmology. Its first iteration contains 57 problems spanning five difficulty levels from Easy Undergrad to Research. Problems do not come from public problem collections, and the benchmark requires the final answer as a Python callable for auto-verification. The strongest models perform well on lower levels but leave research-level questions mostly unsolved; level-5 average scores remain about 15% for the strongest systems. The paper therefore presents TPBench as a benchmark of expert theoretical-physics reasoning rather than broad physics literacy (Chung et al., 19 Feb 2025).
SymPyBench expands the code-driven model. It contains 15,045 university-level physics problems with a 90/10% train/test split, three question formats—MC-Symbolic, MC-Numerical, and free-form—and executable Python ground truth for any parameter setting. Beyond standard accuracy, it defines Consistency Score, Confusion Rate, and Complete Failure Rate over controlled problem variants. This makes it especially diagnostic for stability under paraphrase and numerical perturbation, an issue that many static “Physics-IQ” candidates do not expose directly (Imani et al., 5 Dec 2025).
5. Multimodal, Olympiad, and diagram-dependent benchmarks
Another major branch targets visual grounding in academic physics. PhysUniBench is a dedicated undergraduate-level multimodal benchmark with 3,304 questions, 3,304 images, 8 major sub-disciplines, 2,057 open-ended items, and 1,247 multiple-choice items. It is bilingual in English and Chinese and uses a model-in-the-loop curation pipeline with 16 independent roll-outs from Qwen2.5-VL-72B to filter easy items and calibrate a nearly balanced five-level difficulty distribution. Performance remains low: GPT-o4-mini reaches 36.7% on MC and 26.5% on OE, while open-ended Quantum Mechanics is near collapse across models (Wang et al., 21 Jun 2025).
Multi-Physics focuses on Chinese high-school multimodal reasoning. It contains 1,412 image-associated multiple-choice questions and 1,438 images across 11 high-school physics subjects and 5 difficulty levels. Its distinctive contribution is a dual evaluation framework combining answer accuracy with Average Step Accuracy (ASA) and Average Step Count (ASC) for chain-of-thought integrity. Under image-conditioned evaluation, Gemini-2.5-Pro reaches 78.4 ACC and 85/5.0 ASA/ASC, while performance without images drops markedly. The benchmark’s core claim is that correct answers can conceal flawed reasoning, and that image-conditioned physics competence is not the same as text-only competence (Luo et al., 19 Sep 2025).
SeePhys is broader in academic range, spanning 2,000 questions, 2,245 images, 7 fundamental domains, 21 categories of highly heterogeneous diagrams, and knowledge levels from middle school to PhD qualifying exams. Its defining statistic is that 75% of the benchmark is vision-essential: the diagram contains indispensable problem-solving information. Evaluation under Text+Vision, Text+Caption, Text Only, and Vision Only settings shows that even the strongest multimodal models remain limited; Gemini-2.5-Pro reaches 54.9% overall and 49.0% on the vision-essential subset. The benchmark’s main diagnosis is that current systems still struggle to couple diagram interpretation with formal physics reasoning and continue to rely on textual cues as shortcuts (Xiang et al., 25 May 2025).
HiPhO moves the multimodal regime into real Olympiad examinations. It compiles 13 latest Olympiad exams from 2024–2025, totaling 360 problems and 519 subquestions, with official medal thresholds and, where available, official marking schemes. Its five physics fields are Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics; its four modality types distinguish text-only, illustration figures, variable figures, and data figures. The benchmark grades with a rule
so a correct final answer can earn full credit, but wrong final answers may still receive partial credit for valid reasoning. In aggregate, closed-source reasoning MLLMs obtain 6 to 12 gold medals across the 13 exams, while open-source MLLMs mostly remain at or below bronze. HiPhO therefore supplies one of the clearest human-aligned interpretations of “Physics-IQ”: not abstract accuracy, but Olympiad-level performance under official scorelines (Yu et al., 9 Sep 2025).
6. Intuitive physics, possible–impossible discrimination, and intervention
A separate tradition defines “Physics-IQ” in terms of core perceptual physics rather than academic derivation. The original IntPhys benchmark asks whether a model can assign lower plausibility to videos of impossible events than to carefully matched possible events. It is built in Unreal Engine, contains 15,000 training videos of possible events, and organizes evaluation around three concept blocks: object permanence (O1), shape constancy (O2), and spatio-temporal continuity (O3). Its matched-quadruplet construction is designed so that possible and impossible clips are pixel-matched, forcing the system to rely on temporal coherence rather than static visual artifacts (Riochet et al., 2018).
IntPhys 2 expands this framework to four principles—Permanence, Immutability, Spatio-Temporal Continuity, and Solidity—and increases realism through more complex synthetic environments, moving cameras, richer occlusion, and harder scene design. It contains 1,416 videos: 60 in Debug, 1,012 in Main, and 344 in a held-out set. Humans reach 96.44% overall and 92.44% on held-out, whereas the best predictive model, V-JEPA 2, reaches 57.51%, and the best MLLM, Gemini 2.5 Flash, reaches 55.63%. The benchmark’s central finding is that many systems that appear strong on earlier intuitive-physics tests remain close to chance once occlusion, realism, and memory demands increase (Bordes et al., 11 Jun 2025).
PHYRE turns intuitive physics into an intervention problem. Instead of judging plausibility, an agent must solve 2D physical puzzles by placing one or two balls so that a target relation is satisfied. Each tier contains 25 task templates, each with 100 tasks, for 2,500 tasks per tier. The two main action tiers are PHYRE-B and PHYRE-2B, and evaluation emphasizes sample efficiency through AUCCESS,
where is the fraction of tasks solved within attempts. Within-template generalization is far easier than cross-template generalization, and two-ball tasks are much harder than one-ball tasks. PHYRE therefore measures a different but complementary component of “Physics-IQ”: goal-directed intervention under simple mechanics with strong pressure for transfer and sample efficiency (Bakhtin et al., 2019).
Taken together, IntPhys, IntPhys 2, and PHYRE define a subfield in which “physics intelligence” means object persistence, continuity, contact constraints, and causal intervention, rather than textbook equation solving.
7. Research-oriented, process-level, and formal extensions
Recent work pushes the notion of a Physics-IQ benchmark beyond exam questions and intuitive physics into research workflows, process fidelity, and formal verification. “Towards a Large Physics Benchmark” is the clearest blueprint. It proposes a living, community-built benchmark with three task types: multiple-choice conceptual questions, analytical derivations, and open-ended coding challenges. Each item is expert-scored for correctness, difficulty, and surprise, with benchmark-level outputs
The framework is explicitly philosophical as well as technical: it tries to score both scientific understanding and domain-constrained creativity, though the paper remains a proposal rather than a completed large-scale benchmark standard (Barman et al., 29 Jul 2025).
PRL-Bench moves closer to an implemented research benchmark. It is built from 100 curated Physical Review Letters papers from August 2025 to March 2026, covers Astrophysics, Condensed Matter, High-Energy Physics, Quantum Information, and Statistical Physics, allows a code interpreter but disables search, and evaluates six frontier LLMs over five runs per task. The strongest model, Gemini-3.1-Pro, reaches only 44.27 on the 0–100 scale, and the dominant error type is formulaic or conceptual error, typically about 45–55% of failures. PRL-Bench therefore reframes Physics-IQ as frontier-research competence under long-horizon procedural complexity, and finds a large gap between current LLMs and authentic theoretical or computational physics workflows (Miao et al., 16 Apr 2026).
CritPt intensifies this research orientation through unpublished, search-resistant tasks. Its first release contains 71 composite research challenges and 190 checkpoint tasks, authored by 50+ active physics researchers across 30 institutions worldwide. Full challenges are machine-verifiable, but difficult enough that the best base model, GPT-5 (high), reaches only 4.0% average accuracy; code tools raise that to 9.4%, and code plus web to 11.7%. The benchmark’s stricter “consistently solved” criterion—correct in at least four of five runs—drops performance further. CritPt therefore positions Physics-IQ not as broad scientific intelligence, but as the ability to execute exact, research-grade, failure-sensitive reasoning on previously unseen tasks (Zhu et al., 30 Sep 2025).
PRISM-Physics targets a different weakness: final-answer-only scoring. It represents reference solutions as directed acyclic graphs (DAGs) of formulas, defines an ancestor-closure process score
and couples this with a fully rule-based symbolic formula equivalence checker. In the paper’s human-alignment study on 70 problem-solution pairs, PRISM-DAG reaches Kendall’s , compared with 0.294 for LLM-as-Judge and 0.213 for the linear process-scoring baseline. Empirically, final-answer accuracy is low while step-level achievement is much higher, reinforcing the paper’s claim that answer-only metrics systematically understate partial but meaningful physics reasoning (Zhao et al., 3 Oct 2025).
Lean4PHYS moves into fully formal reasoning. It introduces LeanPhysBench, a Lean4 benchmark of 200 hand-crafted and peer-reviewed statements derived from university textbooks and competition problems, together with PhysLib, a community-driven Lean4 library for unit systems and core theorems. The abstract reports that DeepSeek-Prover-V2-7B achieves only 16% and Claude-Sonnet-4 achieves 35%, while the paper’s main table reports Gemini-2.5-Pro at 39.50% with PhysLib. The same paper reports that PhysLib yields an average improvement of 11.75%. Whatever number is taken as canonical, the result is the same: formal, unit-aware, college-level physics reasoning in Lean4 remains difficult, and the benchmark captures a very strict notion of domain-grounded symbolic competence (Li et al., 30 Oct 2025).
Across these research, process, and formal benchmarks, a common conclusion emerges. “Physics-IQ” is best understood not as a single psychometric scalar, but as a family of domain-specific evaluations of physics reasoning, physical world modeling, and research competence. Some benchmarks emphasize breadth and public scale; some emphasize multimodal grounding; some emphasize intuitive object physics; some emphasize research workflows, process scoring, or theorem proving. The field has not yet converged on one universal standard, but it has converged on a shared diagnosis: benchmarking physics by final-answer accuracy alone is no longer sufficient.