Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

PHYBench: AI Physical Reasoning Benchmarks

Updated 19 July 2025

PHYBench is a comprehensive suite of benchmarks that rigorously test AI models’ physical reasoning and problem-solving across text and image modalities.
It evaluates performance on physical commonsense in T2I generation, multi-step symbolic problem solving, and dynamic numerical variants using advanced scoring metrics.
The benchmarks highlight current model limitations and guide research towards integrating symbolic computation, chain-of-thought reasoning, and robust generalization strategies.

PHYBench refers to a set of recent evaluation benchmarks aimed at rigorously quantifying the physical reasoning and physics problem-solving abilities of artificial intelligence models, particularly LLMs and text-to-image (T2I) systems. Several distinct benchmarks have adopted the name or its variants, each targeting different modalities and aspects of physics: physical commonsense in image generation, step-by-step symbolic or numerical problem solving, and multimodal physical reasoning with a focus on visual scenarios. The benchmarks extend across diverse domains such as mechanics, optics, thermodynamics, electromagnetism, material science, and more, and collectively establish the current limitations and research directions for AI systems aspiring to handle authentic physical modeling, reasoning, and perception.

1. Benchmark Variants and Their Scopes

Three major benchmarks with the PHYBench designation, as well as closely related efforts (notably PhyX and ABench-Physics), structure the landscape:

"PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models" (Meng et al., 17 Jun 2024): Focuses on T2I systems' ability to reflect physical commonsense through visual outputs. It centers primarily on fundamental physical phenomena depicted in image synthesis.
"PHYBench: Holistic Evaluation of Physical Perception and Reasoning in LLMs" (Qiu et al., 22 Apr 2025): Measures LLMs' capacity for symbolic reasoning and physical perception via curated textual physics problems, with a systematic scoring approach that rewards partial correctness.
"Theoretical Physics Benchmark (TPBench)" (Chung et al., 19 Feb 2025): Targets the ability of AI models to solve high-level theoretical physics problems, measuring performance across a graded spectrum from undergraduate to research-level, with automatic symbolic verification.
Related Benchmarks: PhyX (Shen et al., 21 May 2025) and ABench-Physics (Zhang et al., 7 Jul 2025), while separate in nomenclature, share the foundational goal of assessing physical reasoning, particularly in multimodal or more dynamic problem domains.

Each benchmark adopts a domain-specific design, testing for aspects such as commonsense, perception, robust reasoning, and the ability to generalize to new or perturbed problems.

2. Dataset Construction and Problem Coverage

Commonsense and T2I (PhyBench (Meng et al., 17 Jun 2024)):

700 prompts drawn from fundamental physical science: mechanics (e.g., gravity, buoyancy), optics (reflection, refraction), thermodynamics (e.g., state changes), and material properties (e.g., viscosity, combustibility).
Each prompt describes a scenario intended to test whether a model’s generated image aligns with physical expectations, without making the relevant physics explicit in the phrasing.
31 distinct physical scenarios, including classic examples such as the imbalance of a seesaw with an elephant and a mouse, or the deformation of a plastic bottle under pressure.

Textual Physics Problem Solving (PHYBench (Qiu et al., 22 Apr 2025), TPBench (Chung et al., 19 Feb 2025)):

PHYBench (Qiu et al., 22 Apr 2025): 500 original, carefully curated physics problems ranging from high school to Olympiad level, spanning mechanics, electromagnetism, optics, thermodynamics, and modern physics. Each problem requires stepwise derivation and yields a unique, symbolic answer, usually in LaTeX format.
TPBench (Chung et al., 19 Feb 2025): 57 novel, graduate/research-level theoretical physics problems, categorized into five difficulty levels (from “Easy Undergrad” to “Research”). Each problem includes an expert solution and Python-based auto-verification for the symbolic answer’s correctness.
Both benchmarks ensure that problem phrasing is unambiguous and require that models not only retrieve answers but demonstrate authentic reasoning capabilities.

Multimodal and Generalization Benchmarks (PhyX (Shen et al., 21 May 2025), ABench-Physics (Zhang et al., 7 Jul 2025)):

PhyX: 3,000 multimodal physics questions spanning six core domains and 25 subdomains, structured for both open-ended and multiple-choice evaluation. Each instance includes schematic images and tests model integration of visual and textual information.
ABench-Physics: Divided between a static set (400 graduate/Olympiad problems) and a dynamic set (100 problems, each with numerical variants), challenging models to generalize reasoning to perturbed scenarios and to achieve numerical precision within stringent tolerances.

3. Evaluation Methodologies and Metrics

Physical Commonsense in Images (T2I):

Dual-score system: Scene score (0–2) for visual fidelity; Physical correctness score (0–3) for the accuracy of depicted physics.
Automates evaluation using GPT-4o with item-specific, scenario-targeted grading instructions (PhyEvaler framework). In scenarios where object location is critical, models’ spatial arrangements are annotated using tools such as GroundingDino to reduce hallucinations and support refined judgement.

Symbolic and Numerical Answer Scoring (LLMs):

PHYBench (Qiu et al., 22 Apr 2025): Introduces the Expression Edit Distance (EED) score, which compares the model’s generated LaTeX expression with the ground truth, after parsing into expression trees. EED rewards partial matches, not just binary correctness:

$r = \frac{\text{Edit Distance between } T_{\text{gt}}, T_{\text{gen}}}{|T_{\text{gt}}|}$

with a piecewise scoring function awarding 100 for perfect matches, linearly decreasing for $0 < r < 0.6$, and zero otherwise.

TPBench (Chung et al., 19 Feb 2025): Relies on auto-verifiability through executable Python code that checks answers for mathematical equivalence, alongside holistic grading of the full reasoning chain.

Multimodal and Robustness Testing:

PhyX and ABench-Physics employ multi-step evaluation protocols, often with chain-of-thought prompting, rule-based answer extraction, and LLM-based judging for open-ended scenarios.
ABench-Physics explicitly tests generalization by requiring correct solutions across all numerical variants for dynamic problems, with answers accepted only within a tight relative error margin (≤ 1%).

4. Model Performance and Observed Limitations

Commonsense in T2I:

Most state-of-the-art T2I models adequately replicate the superficial features of scenes but frequently fail to apply correct physics, especially outside of optics where extensive training data is available.
Proprietary models, such as DALL-E 3, surpass open-source systems in both scene and physical correctness scores, but even advanced models lack genuine reasoning about physical scenarios.

LLMs on Symbolic Physics:

LLMs like Gemini 2.5 Pro achieve only 36.9% accuracy (EED ≈ 49.5) on PHYBench (Qiu et al., 22 Apr 2025); humans attain 61.9% accuracy (EED ≈ 70.4). State-of-the-art models perform near 100% on easy undergrad tasks, with success rates dropping to 15–20% for research-level physics (Chung et al., 19 Feb 2025).
Main failure modes include algebraic errors, logical missteps, over-reliance on memorized answers, and breakdown of multi-step reasoning chains.

Multimodal and Generative Generalization:

On PhyX (Shen et al., 21 May 2025), GPT-4o and comparable models achieve 32.5–45.8% on open-ended questions, with human experts exceeding 75%; errors are roughly split between visual misunderstanding, misinterpretation of physics, and computational mistakes.
ABench-Physics (Zhang et al., 7 Jul 2025) reveals a 22.5% performance drop from static to dynamic (numerically perturbed) variants, exposing how current models depend on memorized patterns rather than robust physical modeling.

5. Benchmark Impact and Comparison

These benchmarks collectively establish robust new baselines for evaluating both the literal and conceptual understanding of physics by AI models:

Benchmark	Modalities	Problem Types	Evaluation Specifics
PhyBench [2406]	T2I	Physical commonsense in images	Scene + Physical correctness
PHYBench [2504]	LLMs	Symbolic, multi-step textual	EED scoring, multi-step reasoning
TPBench [2502]	LLMs	Theoretical/research-level	Auto-verification + holistic
PhyX [2505]	Multimodal	Visual+textual, diverse tasks	MC/OE, multimodal, error analysis
ABench-Physics	LLMs	Static + dynamic numerical	Numerical with strict tolerance

Compared to prior efforts (e.g., math benchmarks like AIME or OlympiadBench), PHYBench and its counterparts foreground real-world physics, stepwise reasoning, symbolic fidelity, and the ability to generalize or adapt to altered scenarios.

6. Key Findings and Research Directions

Current AI models, despite substantial advances, show persistent gaps in genuine physical reasoning, especially in the face of implicit instructions, multi-step symbolic derivations, visual-physical integration, and problem generalization.
Prompt rewriting (making implicit physics explicit) dramatically improves T2I correspondence to expected outcomes, indicating an absence of deeply internalized physical commonsense (Meng et al., 17 Jun 2024).
Innovations like EED scoring allow finer discrimination of model errors, moving beyond binary correctness to reward partial step fidelity (Qiu et al., 22 Apr 2025).
The repeated observation that model performance drops with dynamic or out-of-distribution variants (as in ABench-Physics) highlights the importance of future training paradigms and the potential of reinforcement or hybrid systems integrating symbolic computation tools.
The benchmarks encourage expansion toward more open-ended, real-world, and multimodal problems and advocate for systematic integration with computer algebra systems and rigorous chain-of-thought frameworks.

7. Accessibility and Community Engagement

All major PHYBench datasets and results are publicly available:

PhyBench for T2I: To be released (Meng et al., 17 Jun 2024)
PHYBench (LLMs, EED scoring): [phybench-official.github.io/phybench-demo/]
TPBench: [tpbench.org]
PhyX: [phyx-bench.github.io]
ABench-Physics: Upon publication

These resources include sample problems, expert solutions, evaluation scripts, and dashboards for comparing model results. Protocols are established to avoid data leakage into model training and promote transparent reporting and standardized evaluation. Continuous community feedback and problem contribution are supported, particularly in TPBench.

Conclusion

PHYBench and its related benchmarks set the contemporary standard for holistic, multi-level, and multimodal evaluation of physical reasoning in AI systems. By combining scenario-based commonsense, symbolic manipulation, rigorous scoring, generative problem design, and domain coverage from introductory to research-level physics, these tools both diagnose present limitations and direct future research. They emphasize the need for AI models to go beyond proficient pattern replication toward robust, interpretable, and scientifically grounded physical reasoning and world modeling.