TheoremQA-T Benchmark

Updated 3 December 2025

TheoremQA-T is a theorem-driven question-answering benchmark designed to evaluate LLMs on applying formal theorems across multiple scientific and engineering domains.
It features 800 expert-curated questions tied to 354 theorems from mathematics, physics, EE&CS, and finance, utilizing chain-of-thought and program-of-thought approaches.
The benchmark also explores advanced RL-based hierarchical proof decomposition to enhance prompting strategies and boost overall scientific reasoning performance.

TheoremQA-T is a theorem-driven question-answering benchmark that rigorously evaluates the capacity of LLMs to apply formal theorems to solve complex science and mathematics problems. Spanning 800 expert-curated questions across mathematics, physics, electrical engineering, computer science, and finance, it tests models on their ability to invoke core theorems in explicit multi-step reasoning, with answer formats chosen for automatic evaluation and straightforward normalization. The benchmark functions both as a dataset and as a framework for empirical assessment of prompting strategies, algorithmic decomposition, and RL-based training for theorem-centric scientific reasoning (Chen et al., 2023, Dong et al., 2024).

1. Dataset Structure and Taxonomy

TheoremQA-T consists of $\mathcal{T} = \{(q_i, a_i, t_i): i=1,\ldots,800\}$ triples, where each $q_i$ is a university-level question requiring explicit use of a specified theorem $t_i$ , and $a_i$ is a simple answer (integer, float, list, boolean, or MCQ). The theorems ( $\sim$ 354) are classified into:

Mathematics (199 theorems): Encompassing calculus (Taylor, Stokes, Divergence), algebra (Sylow, Lagrange), number theory (Fermat, CRT), combinatorics, probability, topology, optimization.
Physics (52 theorems): Classical mechanics (Noether), electromagnetism (Maxwell), quantum mechanics, thermodynamics.
EE&CS (48 theorems): Signal processing (Nyquist–Shannon), coding theory (Huffman), algorithms, control theory.
Finance (55 theorems): Asset pricing (Black–Scholes), risk, annuities.

Questions are drawn from expert rewriting and textbook sources, focusing on problems where a named theorem must be applied in a nontrivial solution pipeline. Multimodal (image-based) questions account for $\sim$ 6\% of the benchmark. No fixed train/test split is mandated; a typical practice is a random 80/10/10 split for reproducibility (Chen et al., 2023).

2. Theorem Statement Encoding and Example Problems

Every question in TheoremQA-T is anchored to a canonical theorem with a LaTeX-formatted statement. Examples include:

Taylor’s Theorem: For $f$ $(n+1)$ -times differentiable on $[a,b]$ ,

$f(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + \frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1}$

for some $\xi \in (a,x)$ .

Lagrange’s Theorem (Group Order):

$|G| = [G : H] \times |H|, \quad [G : H] = |\{gH : g \in G\}|$

Huffman Coding Theorem:

$L_{\min} = \sum_i p_i \ell_i$

for a prefix-free code on symbol set $\mathcal{S}$ .

Example problem walkthroughs illustrate mapping the natural language question to formal theorem invocation and extracting an answer in a simple numeric format. For instance, using Stokes’ theorem to evaluate the circulation of a vector field, or Wiener process theorems for covariance computations.

3. Evaluation Protocols and Prompting Strategies

Evaluation proceeds via two principal modes:

Chain-of-Thought (CoT): The model generates stepwise natural language reasoning culminating in a direct answer.
Program-of-Thought (PoT): The model outputs an executable Python program implementing the solution pipeline, which is sandboxed to compute the final answer.

Post-processing includes span extraction and normalization through external APIs (e.g., WolframAlpha), ensuring strict exact-match scoring. Four main metrics are reported: overall accuracy, field-wise accuracy, type-wise accuracy, and program executability (for PoT).

Model (PoT)	Math	CS/EE	Physics	Finance	Overall
GPT-4	52.0	51.4	45.8	66.7	52.4
ChatGPT	35.7	35.6	26.7	49.4	35.6
Open-source LMs	<15	<28	<6	<13	<15
Random Baseline	10.0	24.7	0.0	4.9	10.5

GPT-4 achieves top accuracy with PoT prompting (52.4%), while open-source models perform marginally above random. Finance questions are the easiest; physics the hardest. Error breakdown attributes most GPT-4 failures to minor calculation mistakes, while open models largely fail due to lack of theorem knowledge (Chen et al., 2023).

4. RL-Based Proof Decomposition and Hierarchical Reasoning

Recent advancements (Dong et al., 2024) propose RL frameworks that enhance LLM performance on theorem-solving tasks by rewarding explicit hierarchical decomposition:

Conditional Proofs: Models can insert lemma proposals using explicit tokens (e.g., <invoke>…</invoke>), triggering recursive proof-tree construction.
Hierarchical Reward Design: Partial progress is rewarded; any correct sub-goal or lemma earns positive reinforcement even if the full theorem is not solved.
Value Function: $V_\phi(c, s) \approx \mathbb{E}_{\pi_\theta}[r(c,s,\cdot)]$ guides exploratory lemma proposals toward promising directions.
Hindsight Augmentation: Novel, correct lemmas found during RL are recycled into the training buffer and augmented as ground-truth, massively expanding the effective sample space (with 37.7% of buffer lemmas being novel).
Implementation: The policy is trained via REINFORCE on weighted samples, minimizing:

$\mathcal{L}(\theta) = -\mathbb{E}_{(x, y, w)}[w \cdot \log \pi_\theta(y | x)]$

Empirically, RL with product-decomposition and lemma rewards (ProD-RL) improves pass@16 from 40.8% (SFT) to 45.5% on held-out AFP and from 36.5% to 39.5% out-of-distribution (Dong et al., 2024).

5. Implications and Adaptation for TheoremQA-T

Applying RL-based hierarchical proof strategies to TheoremQA-T suggests incorporating:

Intermediate Sub-goal Tokens: Allow models to explicitly propose and pursue lemma-like sub-questions in both natural language and formalized forms.
Partial Success Rewards: Structure data generation, training, and evaluation to reward any correct sub-step (not just final answers).
Hybrid Verification: Substitute full formal verification (e.g., Isabelle) with lightweight, automatic checkers or neural verifiers for natural language sub-questions.
Depth Limiting: Impose maximum proof-tree depth to manage recursion in problem decomposition for complex theorem-based tasks.
Replay Buffer Expansion: Retain novel lemma solutions for future sample augmentation, facilitating continual learning.

A plausible implication is that these methods, when adapted, can substantially enhance LLM pass rates on TheoremQA-T, particularly for open-source models currently bottlenecked by knowledge and reasoning granularity (Dong et al., 2024). Integration of structured RL with hierarchical decomposition transforms TheoremQA-T evaluations from atomic answer scoring to multi-step scientific reasoning with explicit theorem invocation.

6. Directions for Research and Benchmark Evolution

TheoremQA-T establishes best practices for rigorous evaluation, favoring Program-of-Thought prompting with normalization for robust automated scoring. Recommendations for researchers include reporting both CoT and PoT performance, full type/category breakdowns, and executability statistics. Promising extensions encompass:

Symbolic solutions (formulas, matrices) beyond simple scalar answers.
Deep integration of symbolic theorem statements in prompts.
Advanced handling of multimodal (diagram) questions.
Science-focused pretraining for open-source models to narrow the gap with closed-source leaders.
Dynamic, multi-theorem reasoning or retrieval-based decomposition.

Ongoing work challenges models to generalize not only across domains but also across problem formats and reasoning types, making TheoremQA-T a principled arena for architectural, algorithmic, and dataset advances in theorem-driven question answering (Chen et al., 2023, Dong et al., 2024).

Markdown Upgrade to Chat

References (2)

TheoremQA: A Theorem-driven Question Answering dataset (2023)

Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TheoremQA-T.