Theoretical Physics Benchmark (TPBench)

Updated 14 October 2025

TPBench is a benchmark that evaluates AI’s reasoning, calculation, and problem-solving skills in theoretical physics.
It features fifty-seven problems across five difficulty levels, covering areas from high-energy physics to quantum mechanics.
The benchmark employs automated verification and holistic grading to provide precise, actionable feedback on model performance.

The Theoretical Physics Benchmark (TPBench) refers to a structured set of problems, datasets, and evaluation protocols designed to measure the ability of AI systems—particularly LLMs—to perform reasoning, calculation, and problem solving in theoretical physics. The initial TPBench dataset is curated to target high-energy theory and cosmology, spans fifty-seven problems across five rigorous difficulty levels, and implements both automated verification and holistic grading to interrogate models’ real reasoning capabilities. TPBench thus provides a reference platform for monitoring, comparing, and accelerating progress in AI-assisted theoretical physics research.

1. Benchmark Scope, Problem Diversity, and Difficulty Levels

TPBench is constructed to probe a broad range of theoretical physics competency. Its problems range from advanced undergraduate calculations to unsolved research-level steps. Domains covered include high-energy physics, general relativity, electromagnetism, astrophysics, quantum mechanics, and statistical mechanics. The dataset’s five-level difficulty system is designed so that:

Levels 1–2 correspond to standard undergraduate or graduate exercises.
Levels 3–5 progressively increase abstraction, requiring novel derivations or the solution of open-ended research questions.

Problems are intentionally novel and avoid overlap with public repositories in order to minimize training data leakage and to ensure that model performance reflects genuine reasoning instead of memorization or pattern recognition.

2. Evaluation Protocols and Scoring

TPBench employs two complementary strategies to evaluate AI models:

Auto-Verifiable Answer-Only Protocol: Each problem is paired with a standardized Python function that must compute the final answer. Model-generated code is executed against multiple test cases to assess correctness numerically or, where applicable, symbolically.
Holistic Grading: Solutions are graded using an AI-based system that mimics partial credit assignment, emphasizing logical flow and correctness of intermediate steps. Letter grades (A–D) reflect solution completeness and reasoning quality.

Both approaches allow for the quantitative monitoring of model progress and facilitate a detailed diagnosis of model failure modes. Technical limitations are observed when verifying answers involving tensor algebra, derivatives, or integrals—domains with high representational ambiguity and where many equivalent forms exist.

3. Empirical Performance on TPBench

Results reveal persistent gaps in AI performance at higher difficulty levels. On levels 1–2, state-of-the-art models (GPT-4o, DeepSeek-R1, Llama, Qwen) attain near-perfect accuracy (95–100%). Level 3 sees a decline to approximately 80%. At level 4 (challenging graduate-level), models achieve ~50%, and for level 5 (research level), performance drops further to 15%. These rates are measured both by average scores over multiple attempts and best-of-five accuracy metrics.

Key obstacles include:

Algebraic errors, such as sign mistakes and dropped factors (e.g., missing factors of $i$ )
Logical mistakes, such as misapplied expansions or incorrect identification of integration domains
Propensity to "retrieve" literature-style answers without stepwise derivation
Hallucination of spurious rules

These observations imply that current LLMs excel at routine symbolic manipulations but remain unreliable for sustained, multi-stage derivations characteristic of advanced research.

4. Verification, Data Leakage Control, and Ongoing Dataset Management

To mitigate the risks of overfitting and data leakage, only a subset of problems is made publicly available (10 samples with broad coverage). Approximately half of the problems are held in reserve, and full dataset access requires direct contact with the TPBench authors. This protocol helps maintain the integrity of TPBench as a reference for genuinely new inference tasks.

The TPBench website (http://tpbench.org) serves as a hub for dataset releases, updated model scoreboards, and problem/solution contributions. The resource is maintained as a “living benchmark,” with regular updates and invitations for the community to contribute new problems and refinements.

5. Challenges and Prospective Strategies

Several technical hurdles limit current AI utility in theoretical physics:

Precise symbolic manipulation: Existing LLMs struggle to maintain algebraic rigor in long derivations, and lack error correction mechanisms for minor miscalculations.
Verification of complex expressions: Auto-verification for non-algebraic answers (especially tensors, integrals) is often nontrivial due to form ambiguity.
Integration with symbolic algebra tools: Deeper coupling with packages like SymPy or Mathematica is pursued as a means to enhance reasoning and error correction during inference.

Proposed strategies include:

Robust symbolic verification mechanisms to address equivalence in complex answer formats.
Reinforcement learning and chain-of-thought prompting schemes to foster improved stepwise reasoning.
Uncertainty quantification heuristics to guide grading and indicate loci of model uncertainty in intermediate results.

6. Mathematical and Theoretical Exemplars

Problems and solutions in TPBench frequently involve high-level mathematical constructs. Examples include:

Final formula for the angular frequency of scattered photons:

$\omega = \frac{1}{\frac{\hbar}{E} + \frac{\hbar}{m c^2}(1-\cos\theta)}$

Derivations of tensorial objects, such as Riemann curvature tensors, and application of steepest-descent methods for contour integrals

These exemplars illustrate the required degree of symbolic sophistication and the associated challenges in equivalence verification—especially for expressions admiting multiple canonical forms.

7. Impact and Future Directions

TPBench functions as both an evaluation framework and an instrument for identifying the shortcomings of existing AI approaches in theoretical physics. By exposing errors in algebraic and logical reasoning and by providing granular breakdowns of model strengths and weaknesses, TPBench directs research attention toward the key obstacles in AI-assisted scientific work. The benchmark’s ongoing expansion and openness to community contributions suggest its centrality as a reference for future progress towards reliable AI research assistants in theoretical physics. The benchmarking approach, combining automated verification and holistic grading, is positioned to become a standard for monitoring both incremental and qualitative developments in AI-for-Science.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Theoretical Physics Benchmark (TPBench).