Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models (2504.16074v2)

Published 22 Apr 2025 in cs.CL

Abstract: Current benchmarks for evaluating the reasoning capabilities of LLMs face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

PHYBench is a new benchmark and evaluation framework designed to rigorously assess the capabilities of LLMs in understanding and reasoning about physical phenomena (Qiu et al., 22 Apr 2025 ). The paper argues that existing benchmarks for LLM reasoning often suffer from limitations such as oversimplified tasks, overly abstract problems detached from real-world physics, and inadequate evaluation metrics that fail to capture nuanced understanding.

To address these limitations, PHYBench introduces:

  1. A High-Quality Benchmark Dataset: PHYBench consists of 500 physics problems carefully curated from real-world physical scenarios. These problems span various domains including mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, covering a wide range of difficulty levels from high school to undergraduate and Physics Olympiad challenges. The questions are designed to require models to understand the physical situation from text, construct spatial and interaction relationships, apply relevant physics laws, and perform complex symbolic calculations to derive an expression for a specific physical quantity. The problems are designed to be solvable purely from the text description and require a single, unambiguous symbolic expression as the answer in LaTeX format. The curation process involved multiple rounds of contributions, review, refinement, and testing against both LLMs and human experts to ensure high quality and prevent data leakage.
  2. A Novel Evaluation Metric: Expression Edit Distance (EED) Score: Recognizing the limitations of traditional binary (correct/incorrect) scoring, PHYBench proposes the EED Score. This metric measures the similarity between the model-generated symbolic expression and the ground truth expression by calculating the edit distance between their tree representations.
    • The process involves converting LaTeX expressions to SymPy-compatible forms.
    • A basic binary score (100 for exact equivalence after simplification, 0 otherwise) is calculated first.
    • For detailed scoring, expressions are converted into tree structures (using SymPy's internal representation).
    • An extended Zhang-Shasha algorithm is used to compute the minimum number of node-level edits (insertions, deletions, updates, including discounted costs for subtree operations) required to transform the generated expression tree into the ground truth tree.
    • The final EED Score is derived from the relative edit distance (edit distance normalized by the size of the ground truth tree), providing a continuous score that rewards partial correctness and reflects the degree of similarity in the expression's structure.
    • The EED Score allows for more granular evaluation, capturing nuanced differences in reasoning that might lead to minor errors (e.g., coefficient errors) rather than complete failures.

Practical Implications and Implementation:

Implementing a system evaluated or trained on PHYBench would involve several practical considerations:

  • Problem Input: The system needs to process physics problems provided as text. This requires robust natural language understanding capabilities to identify physical objects, quantities, initial conditions, and the target variable for which an expression is needed.
  • Physical Perception Module: A core component would be a module capable of building an internal representation of the physical scenario described in the text. This involves spatial reasoning (understanding relative positions and geometry), identifying interactions (forces, fields, energy transfer), and making qualitative judgments about significant effects. This aligns with the paper's concept of "Physical Perception" (PP).
  • Reasoning Engine: A robust reasoning engine is needed to apply relevant physics principles and laws based on the perceived physical scenario. This involves selecting appropriate equations, setting up systems of equations, and performing symbolic manipulation and algebraic problem-solving. This corresponds to the paper's "Robust Reasoning" (RR) phase.
  • Symbolic Computation: The system must be able to handle symbolic mathematical expressions accurately. Integration with symbolic math libraries like SymPy (as used in the EED metric) would be essential for manipulation and simplification. Errors in this phase (RR errors) are a significant challenge highlighted by the benchmark results.
  • Output Formatting: The final output must be a single symbolic expression in correct LaTeX format, enclosed in \boxed{}. Parsing and formatting capabilities are crucial for successful evaluation on PHYBench.
  • Evaluation Pipeline: To evaluate models using PHYBench, an automated pipeline is necessary to:

    1. Feed the problem text to the model.
    2. Extract the LaTeX expression from the model's output (specifically from \boxed{}).
    3. Process the generated expression (e.g., normalize LaTeX, convert to SymPy).
    4. Calculate the binary accuracy and EED Score against the ground truth expression using the described method (SymPy equivalence check and tree edit distance).
    5. Aggregate scores across the dataset.

Key Findings and Challenges for Implementation:

The experiments conducted on PHYBench showed that even state-of-the-art LLMs significantly lag behind human experts. The best LLM achieved only 36.9% accuracy and 49.5 EED Score, compared to human baselines of 61.9% accuracy and 70.4 EED Score (Qiu et al., 22 Apr 2025 ). This performance gap highlights significant challenges in building AI systems capable of complex physical reasoning:

  • Physical Perception Limitations: Models struggle to accurately interpret and model realistic physical scenarios, especially those involving complex kinematics or multi-body interactions, leading to fundamental errors early in the reasoning process (PP errors).

  • Robust Reasoning Limitations: LLMs face difficulties maintaining consistency and accuracy during long, intricate symbolic derivations, resulting in accumulation of errors and divergence from the correct solution (RR errors).
  • Integration of Perception and Reasoning: Successfully solving PHYBench problems requires seamlessly integrating the understanding of the physical situation with the application of mathematical principles. Current models appear to struggle with this integrated process.

The EED Score provides a valuable diagnostic tool during development. A low EED Score with many small differences might indicate issues with robust symbolic manipulation (RR), while a low score due to missing large sub-expressions might point towards a failure to perceive or model a key physical component (PP).

Computational Requirements and Deployment:

Evaluating models on PHYBench requires computational resources for running the LLMs and for the EED calculation pipeline. While the EED calculation itself (based on the extended Zhang-Shasha algorithm) has polynomial time complexity (O(n1n2d1d2)O(n_1n_2 d_1 d_2)), the bottleneck is typically the LLM inference, especially for large models and tasks requiring potentially long chains of thought to arrive at the solution. Deploying AI systems capable of this level of physical reasoning in real-world applications (e.g., robotics, engineering simulations, scientific discovery) would necessitate efficient inference for complex reasoning processes and potentially integration with external physics engines or symbolic solvers to enhance robustness.

In summary, PHYBench provides a valuable tool for advancing AI capabilities in physical reasoning by offering a challenging, realistic benchmark and a nuanced evaluation metric. The results demonstrate that achieving human-level performance in this domain requires significant progress in both physical perception and robust symbolic reasoning, representing key areas for future research and development efforts in building AI systems for real-world scientific and engineering tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (54)
  1. Shi Qiu (42 papers)
  2. Shaoyang Guo (1 paper)
  3. Zhuo-Yang Song (5 papers)
  4. Yunbo Sun (1 paper)
  5. Zeyu Cai (13 papers)
  6. Jiashen Wei (1 paper)
  7. Tianyu Luo (1 paper)
  8. Yixuan Yin (3 papers)
  9. Haoxu Zhang (2 papers)
  10. Yi Hu (129 papers)
  11. Chenyang Wang (40 papers)
  12. Chencheng Tang (3 papers)
  13. Haoling Chang (1 paper)
  14. Qi Liu (485 papers)
  15. Ziheng Zhou (17 papers)
  16. Tianyu Zhang (110 papers)
  17. Jingtian Zhang (1 paper)
  18. Zhangyi Liu (1 paper)
  19. Minghao Li (44 papers)
  20. Yuku Zhang (1 paper)
Youtube Logo Streamline Icon: https://streamlinehq.com