Reliability of AI reasoning models for physics problem solving

Determine whether Large Language Model reasoning systems—including models such as OpenAI’s o3-mini—can be considered reliable for physics problem solving under the study’s working definition of reliability, namely producing correct answers repeatedly across introductory physics story problems and topics.

Background

The paper evaluates OpenAI’s o3-mini on 408 text-only end-of-chapter problems from Halliday and Resnick’s Fundamentals of Physics Vol. 1, sampling five solutions per problem and counting a problem as solved only if all five runs produce the correct textbook answer. The model achieves an overall success rate of 94%, with notably lower performance on later chapters (waves and thermodynamics).

In discussing error sources, the authors identify two inherent limitations: lack of mechanisms to evaluate intermediate reasoning steps (e.g., via physics simulation) and susceptibility to calculation/rounding mistakes (e.g., due to absent external mathematical computation). They note these might be mitigated by augmenting the model with dedicated tools, but emphasize that comprehensive augmentation has not yet been undertaken, leaving the broader question of model reliability unresolved.

References

Until then, the question of reasoning models' reliability for the purposes of problem solving will remain open.

— AI Reasoning Models for Problem Solving in Physics (2508.20941 - Bralin et al., 28 Aug 2025) in Discussion and Conclusions

Reliability of AI reasoning models for physics problem solving

Background

References

Related Problems