PhysicsEval: LLM Physics Benchmark
- PhysicsEval is a systematically curated benchmark comprising nearly 20,000 physics problems from high-school to graduate levels designed to evaluate LLM reasoning.
- The benchmark employs a six-dimensional rubric and multi-agent verification techniques to rigorously assess solution correctness and reasoning fidelity.
- Results show measurable improvements in LLM performance, particularly in mathematical accuracy, logical consistency, and principled application of physics formulas.
PhysicsEval is a large-scale, systematically curated benchmark and evaluation protocol designed to assess and improve the ability of LLMs to solve physics problems. It serves both as a dataset comprising thousands of high-school through graduate-level physics problems and as a testbed for a taxonomy of inference-time verification techniques, including agentic multi-LLM frameworks. By enforcing fine-grained, rubric-based evaluation and introducing cumulative solution-checking protocols, PhysicsEval rigorously quantifies both solution correctness and reasoning fidelity in LLMs tasked with physics problem solving (Siddique et al., 31 Jul 2025).
1. Benchmark Construction and Dataset Structure
PhysicsEval consists of 19,609 open-ended physics problems, distributed roughly as a 90% training set (17,647) and a 10% reserved evaluation set (1,962). Problems are annotated with a difficulty label ranging from 1 to 10 (mean ≈ 5.72), with average solutions requiring about 3,830 tokens and 3.9 logical steps.
Topical Breadth and Provenance
Problems encompass 19 high-level physics domains, including but not limited to:
- Classical Mechanics and Dynamics (443)
- Electromagnetism & Circuit Theory (259)
- Thermodynamics & Heat Transfer (234)
- Optics & Wave Phenomena (132)
- Relativity & Gravitation (68)
- Quantum Mechanics & Atomic Physics (155)
- Astrophysics, Cosmology & Space Science (105)
- Statistical Mechanics & Kinetic Theory (18)
- Eleven additional categories (e.g., Food Physics, Culinary Science)
Problem sources include online physics-education forums (notably Physics Stack Exchange), university/exam-preparation sites, and 20 canonical physics textbooks (e.g., Halliday & Resnick, Giancoli, Griffiths, Taylor, Carroll).
Data Format
Each instance comprises the following fields:
- problem_id (UUID)
- problem (raw text)
- simplified_problem_statement
- category, key tags (e.g., “steady state”)
- elaborated_solution_steps (LaTeX-annotated)
- alternative_solutions (if present)
- problem_difficulty ∈ [1,10]
- final_answers_in_brief
- steps (logical step count)
- source (textbook or URL)
Solution formats are exclusively open-ended and require either numeric (with units) or symbolic answers, accompanied by stepwise LaTeX-encoded derivations.
2. Evaluation Rubric and Physics Proficiency Score
LLM-generated solutions are holistically rated via a six-dimensional ordinal rubric on the following aspects:
- Mathematical Accuracy (MA): Correctness of calculations, units, and numerics.
- Logical Consistency (LC): Stepwise logical flow and argument validity.
- Completeness (C): Coverage of all sub-questions or components.
- Clarity & Coherence (CC): Readability and organization of the solution.
- Formulas & Principles (FP): Correct identification and application of physical principles.
- Assumptions Made (A): Explicitness and validity of assumed conditions.
These dimensions are aggregated into a single Physics Proficiency Score (PPS):
with PPS normalized to . Statistical significance of improvements is established via paired -tests per metric, with taken as significant (Siddique et al., 31 Jul 2025).
3. Inference-Time Verification Techniques
Several inference-time methodologies are investigated, ranging from naive prompting to multi-agent verification. All techniques use only inference-time interventions; no fine-tuning is performed.
- Baseline: The LLM is prompted directly with the problem plus requests for step-by-step solution and LaTeX equations; no validation or feedback is applied.
- Chain-of-Thought (CoT): Standard instruction to articulate reasoning before providing the answer, intended to elicit more systematic computation.
- Self-Refinement: The solver LLM is prompted to review its initial answer by role-play (“You are a physics professor... check for mistakes”), then re-solves. This is a self-critique loop with no external input.
- Single-Agent Review: A verifier LLM (Qwen3-32B) is presented with the problem and initial solution, returns error lists , and the original solver is then prompted with this critique to generate a refined answer.
- Multi-Agent Review Framework: The initial solution is sequentially reviewed by several independent verifier LLMs (Phi-4, DeepSeek-R1, Qwen3-14B). Each verifier outputs both a scalar rating and a structured list of mistakes. Mistakes are aggregated via a MetaVerifier, and ratings are combined as
If any errors are flagged (), the solver is re-prompted with all critiques to generate a final solution. All feedback is exchanged in structured, traceable formats (JSON, bullet-lists).
4. Quantitative Results and Statistical Insights
The PhysicsEval study provides comparative results across methods and difficulty tiers:
| Model | Easy (1–4) | Medium (5–7) | Hard (8–10) |
|---|---|---|---|
| Baseline | 88.96 | 79.21 | 66.20 |
| Self-Refine | 89.98 | 80.38 | 67.06 |
| SingleAgent | 90.83 | 81.10 | 67.18 |
| MultiAgent | 91.78 | 81.48 | 68.23 |
Multi-agent verification consistently produces significant gains, especially for hard problems (Easy: +2.82 pp, Medium: +2.27 pp, Hard: +2.03 pp). Paired -test analyses confirm the statistical significance of these improvements in overall PPS, mathematical accuracy, and use of formulas/principles (MultiAgent vs Baseline: PPS , MA 0, FP 1).
Ablation studies indicate that:
- Self-refinement alone often does not correct, and sometimes reinforces, initial errors.
- Single-agent review addresses some completeness issues but can reduce solution clarity if critique is inadequate.
- Multi-agent frameworks yield the most robust, statistically significant improvements in both overall proficiency and sub-dimensions of accuracy and principled reasoning.
5. Qualitative Examples and Persistent Error Modes
Sample problems illustrate the framework’s capacities and known limitations:
- Photon Emission Rate (Astrophysics): All verifiers concurred on the baseline LLM’s correct answer; the multi-agent protocol affirmed both numerics and unit handling.
- Toroidal Inductor: The initial solution omitted mention of non-uniform fields; the meta-verifier identified this, prompting an explicit note on the field’s uniformity approximation in the assumptions.
- Residual Error Modes: Even with multi-agent verification, dimensional inconsistency remains in ~5% of hard problems; over-simplified assumptions persist if not caught by all verifiers; complex, diagram-based reasoning (e.g., in multi-body mechanics) often results in logical gaps not fully mitigated by the review process.
6. Relation to Other Physics Reasoning Benchmarks
PhysicsEval advances over prior physical reasoning benchmarks (e.g., PHYRE (Bakhtin et al., 2019)) by focusing on open-ended physics problems requiring analytic derivation and symbolic manipulation, as opposed to action-oriented, simulation-based task-solving. Unlike PHYRE—where sample efficiency, policy generalization, and one-step interventions are central—PhysicsEval specifically targets the textual reasoning and derivational proficiency of LLMs, with a stringent demarcation of physics solving capacity along six axes of evaluation.
7. Significance and Future Directions
PhysicsEval establishes a blueprint for rigorous, fine-grained evaluation of LLM physics reasoning via a large, multi-domain corpus and a sophisticated agentic review protocol. The combination of the six-dimensional rubric, multi-agent verification, and extensive empirical validation enables detailed error analysis and systematic benchmarking across problem types and model families. As LLMs continue to serve as tools in scientific analysis and education, the PhysicsEval protocol provides a reference framework for measuring and improving automated physics reasoning at scale (Siddique et al., 31 Jul 2025).