Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 40 tok/s Pro

2000 character limit reached

PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models (2505.24823v1)

Published 30 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.

Collections

Summary

The paper introduces PhySense, a benchmark that evaluates LLMs' ability to apply core physics principles through principle-based reasoning.
It systematically tests 7 state-of-the-art LLMs on 380 novel physics problems using diverse prompting strategies to assess accuracy and token efficiency.
Findings reveal that even reasoning-optimized models underperform compared to human experts, emphasizing the need for deeper integration of physical principles.

PhySense: Principle-Based Physics Reasoning Benchmarking for LLMs

Introduction

The paper, titled "PhySense: Principle-Based Physics Reasoning Benchmarking for LLMs," addresses a significant gap in the ability of LLMs to emulate the concise, principle-based reasoning that is characteristic of human experts, particularly in the domain of physics. Despite advancements in LLMs, these models often fail to apply core physical principles for efficient problem-solving, generating solutions that are lengthy and opaque. The paper introduces PhySense, a novel benchmark that aims to test LLMs' capacity to solve physics problems using principle-first reasoning, an approach that is typically straightforward for human physicists but challenging for LLMs.

Figure 1: Illustrating how LLMs use lengthy, complex reasoning for physics problems intuitively straightforward to scientists applying core physical concepts.

Benchmark Design and Methodology

Principle-Based Reasoning

PhySense is designed to systematically evaluate LLMs' use of physical principles such as symmetries, conservation laws, and dimensional analysis. Unlike other benchmarks that focus on domain-specific knowledge or complex calculations, PhySense presents problems that are easily solvable using simple principle-based reasoning. The dataset comprises 380 novel physics problems, categorized by difficulty levels—ranging from undergraduate to research-level complexity—but deliberately avoids advanced mathematical techniques or computationally intensive solutions.

Problem Categories and Models

The benchmark tests LLMs on 19 different problem models that encompass a wide range of physics principles. These include symmetry reasoning in two- and three-dimensional fields, infinite resistive lattice circuits, quantum spin chains, dimensional analysis, and topological phenomena in condensed matter physics. The selection of these models ensures a comprehensive assessment of an LLM's understanding and correct application of fundamental physics principles.

Experimental Setup

Models and Prompting Strategies

The experiment evaluates seven state-of-the-art LLMs, including both reasoning-optimized models and standard models, using three prompting strategies: zero-shot, hint, and no-computation prompts. These different approaches test the LLMs' innate problem-solving abilities, their capacity to utilize explicit guidance, and their ability to prioritize simpler, principle-driven solutions over complex computations.

Figure 2: Average accuracy across models.

Results and Analysis

Accuracy and Efficiency

The LLMs' performance is measured by accuracy and token usage, reflecting both the correctness of solutions and the computational resources expended. While reasoning models generally outperform non-reasoning counterparts, all models underperform compared to human physicists, both in terms of accuracy and efficiency. The results highlight significant deficiencies in token efficiency and principled application of physical laws across all evaluated models.

Prompt Impact and Model Comparison

The paper reveals minimal improvement in accuracy from auxiliary prompts, suggesting that LLMs' primary errors arise more from misapplication rather than a lack of awareness of physical principles. Additionally, reasoning models demonstrate a better, albeit insufficient, grasp of applying these principles compared to non-reasoning models, which often default to superficial pattern recognition.

Conclusion and Future Directions

The introduction of PhySense represents a pivotal step in evaluating and improving principle-based reasoning in LLMs. Despite some superiority of reasoning models over their non-reasoning counterparts, there remains a substantial gap in performance relative to expert human reasoning. The paper suggests that future development should focus on deeper integration of principle-based thinking within LLMs, possibly through supervised fine-tuning or reinforcement learning. It offers critical insights for the design of LLMs with enhanced capabilities for efficient, robust, and interpretable scientific reasoning, essential for advancing AI's role in scientific discovery.