Papers
Topics
Authors
Recent
2000 character limit reached

TheoremQA-T Benchmark

Updated 3 December 2025
  • TheoremQA-T is a theorem-driven question-answering benchmark designed to evaluate LLMs on applying formal theorems across multiple scientific and engineering domains.
  • It features 800 expert-curated questions tied to 354 theorems from mathematics, physics, EE&CS, and finance, utilizing chain-of-thought and program-of-thought approaches.
  • The benchmark also explores advanced RL-based hierarchical proof decomposition to enhance prompting strategies and boost overall scientific reasoning performance.

TheoremQA-T is a theorem-driven question-answering benchmark that rigorously evaluates the capacity of LLMs to apply formal theorems to solve complex science and mathematics problems. Spanning 800 expert-curated questions across mathematics, physics, electrical engineering, computer science, and finance, it tests models on their ability to invoke core theorems in explicit multi-step reasoning, with answer formats chosen for automatic evaluation and straightforward normalization. The benchmark functions both as a dataset and as a framework for empirical assessment of prompting strategies, algorithmic decomposition, and RL-based training for theorem-centric scientific reasoning (Chen et al., 2023, Dong et al., 4 Nov 2024).

1. Dataset Structure and Taxonomy

TheoremQA-T consists of T={(qi,ai,ti):i=1,,800}\mathcal{T} = \{(q_i, a_i, t_i): i=1,\ldots,800\} triples, where each qiq_i is a university-level question requiring explicit use of a specified theorem tit_i, and aia_i is a simple answer (integer, float, list, boolean, or MCQ). The theorems (\sim354) are classified into:

  • Mathematics (199 theorems): Encompassing calculus (Taylor, Stokes, Divergence), algebra (Sylow, Lagrange), number theory (Fermat, CRT), combinatorics, probability, topology, optimization.
  • Physics (52 theorems): Classical mechanics (Noether), electromagnetism (Maxwell), quantum mechanics, thermodynamics.
  • EE&CS (48 theorems): Signal processing (Nyquist–Shannon), coding theory (Huffman), algorithms, control theory.
  • Finance (55 theorems): Asset pricing (Black–Scholes), risk, annuities.

Questions are drawn from expert rewriting and textbook sources, focusing on problems where a named theorem must be applied in a nontrivial solution pipeline. Multimodal (image-based) questions account for \sim6\% of the benchmark. No fixed train/test split is mandated; a typical practice is a random 80/10/10 split for reproducibility (Chen et al., 2023).

2. Theorem Statement Encoding and Example Problems

Every question in TheoremQA-T is anchored to a canonical theorem with a LaTeX-formatted statement. Examples include:

  • Taylor’s Theorem: For ff (n+1)(n+1)-times differentiable on [a,b][a,b],

f(x)=k=0nf(k)(a)k!(xa)k+f(n+1)(ξ)(n+1)!(xa)n+1f(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + \frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1}

for some ξ(a,x)\xi \in (a,x).

  • Lagrange’s Theorem (Group Order):

G=[G:H]×H,[G:H]={gH:gG}|G| = [G : H] \times |H|, \quad [G : H] = |\{gH : g \in G\}|

  • Huffman Coding Theorem:

Lmin=ipiiL_{\min} = \sum_i p_i \ell_i

for a prefix-free code on symbol set S\mathcal{S}.

Example problem walkthroughs illustrate mapping the natural language question to formal theorem invocation and extracting an answer in a simple numeric format. For instance, using Stokes’ theorem to evaluate the circulation of a vector field, or Wiener process theorems for covariance computations.

3. Evaluation Protocols and Prompting Strategies

Evaluation proceeds via two principal modes:

  • Chain-of-Thought (CoT): The model generates stepwise natural language reasoning culminating in a direct answer.
  • Program-of-Thought (PoT): The model outputs an executable Python program implementing the solution pipeline, which is sandboxed to compute the final answer.

Post-processing includes span extraction and normalization through external APIs (e.g., WolframAlpha), ensuring strict exact-match scoring. Four main metrics are reported: overall accuracy, field-wise accuracy, type-wise accuracy, and program executability (for PoT).

Model (PoT) Math CS/EE Physics Finance Overall
GPT-4 52.0 51.4 45.8 66.7 52.4
ChatGPT 35.7 35.6 26.7 49.4 35.6
Open-source LMs <15 <28 <6 <13 <15
Random Baseline 10.0 24.7 0.0 4.9 10.5

GPT-4 achieves top accuracy with PoT prompting (52.4%), while open-source models perform marginally above random. Finance questions are the easiest; physics the hardest. Error breakdown attributes most GPT-4 failures to minor calculation mistakes, while open models largely fail due to lack of theorem knowledge (Chen et al., 2023).

4. RL-Based Proof Decomposition and Hierarchical Reasoning

Recent advancements (Dong et al., 4 Nov 2024) propose RL frameworks that enhance LLM performance on theorem-solving tasks by rewarding explicit hierarchical decomposition:

  • Conditional Proofs: Models can insert lemma proposals using explicit tokens (e.g., <invoke>…</invoke>), triggering recursive proof-tree construction.
  • Hierarchical Reward Design: Partial progress is rewarded; any correct sub-goal or lemma earns positive reinforcement even if the full theorem is not solved.
  • Value Function: Vϕ(c,s)Eπθ[r(c,s,)]V_\phi(c, s) \approx \mathbb{E}_{\pi_\theta}[r(c,s,\cdot)] guides exploratory lemma proposals toward promising directions.
  • Hindsight Augmentation: Novel, correct lemmas found during RL are recycled into the training buffer and augmented as ground-truth, massively expanding the effective sample space (with 37.7% of buffer lemmas being novel).
  • Implementation: The policy is trained via REINFORCE on weighted samples, minimizing:

L(θ)=E(x,y,w)[wlogπθ(yx)]\mathcal{L}(\theta) = -\mathbb{E}_{(x, y, w)}[w \cdot \log \pi_\theta(y | x)]

Empirically, RL with product-decomposition and lemma rewards (ProD-RL) improves pass@16 from 40.8% (SFT) to 45.5% on held-out AFP and from 36.5% to 39.5% out-of-distribution (Dong et al., 4 Nov 2024).

5. Implications and Adaptation for TheoremQA-T

Applying RL-based hierarchical proof strategies to TheoremQA-T suggests incorporating:

  • Intermediate Sub-goal Tokens: Allow models to explicitly propose and pursue lemma-like sub-questions in both natural language and formalized forms.
  • Partial Success Rewards: Structure data generation, training, and evaluation to reward any correct sub-step (not just final answers).
  • Hybrid Verification: Substitute full formal verification (e.g., Isabelle) with lightweight, automatic checkers or neural verifiers for natural language sub-questions.
  • Depth Limiting: Impose maximum proof-tree depth to manage recursion in problem decomposition for complex theorem-based tasks.
  • Replay Buffer Expansion: Retain novel lemma solutions for future sample augmentation, facilitating continual learning.

A plausible implication is that these methods, when adapted, can substantially enhance LLM pass rates on TheoremQA-T, particularly for open-source models currently bottlenecked by knowledge and reasoning granularity (Dong et al., 4 Nov 2024). Integration of structured RL with hierarchical decomposition transforms TheoremQA-T evaluations from atomic answer scoring to multi-step scientific reasoning with explicit theorem invocation.

6. Directions for Research and Benchmark Evolution

TheoremQA-T establishes best practices for rigorous evaluation, favoring Program-of-Thought prompting with normalization for robust automated scoring. Recommendations for researchers include reporting both CoT and PoT performance, full type/category breakdowns, and executability statistics. Promising extensions encompass:

  • Symbolic solutions (formulas, matrices) beyond simple scalar answers.
  • Deep integration of symbolic theorem statements in prompts.
  • Advanced handling of multimodal (diagram) questions.
  • Science-focused pretraining for open-source models to narrow the gap with closed-source leaders.
  • Dynamic, multi-theorem reasoning or retrieval-based decomposition.

Ongoing work challenges models to generalize not only across domains but also across problem formats and reasoning types, making TheoremQA-T a principled arena for architectural, algorithmic, and dataset advances in theorem-driven question answering (Chen et al., 2023, Dong et al., 4 Nov 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TheoremQA-T.