TheoremQA-T Benchmark
- TheoremQA-T is a theorem-driven question-answering benchmark designed to evaluate LLMs on applying formal theorems across multiple scientific and engineering domains.
- It features 800 expert-curated questions tied to 354 theorems from mathematics, physics, EE&CS, and finance, utilizing chain-of-thought and program-of-thought approaches.
- The benchmark also explores advanced RL-based hierarchical proof decomposition to enhance prompting strategies and boost overall scientific reasoning performance.
TheoremQA-T is a theorem-driven question-answering benchmark that rigorously evaluates the capacity of LLMs to apply formal theorems to solve complex science and mathematics problems. Spanning 800 expert-curated questions across mathematics, physics, electrical engineering, computer science, and finance, it tests models on their ability to invoke core theorems in explicit multi-step reasoning, with answer formats chosen for automatic evaluation and straightforward normalization. The benchmark functions both as a dataset and as a framework for empirical assessment of prompting strategies, algorithmic decomposition, and RL-based training for theorem-centric scientific reasoning (Chen et al., 2023, Dong et al., 4 Nov 2024).
1. Dataset Structure and Taxonomy
TheoremQA-T consists of triples, where each is a university-level question requiring explicit use of a specified theorem , and is a simple answer (integer, float, list, boolean, or MCQ). The theorems (354) are classified into:
- Mathematics (199 theorems): Encompassing calculus (Taylor, Stokes, Divergence), algebra (Sylow, Lagrange), number theory (Fermat, CRT), combinatorics, probability, topology, optimization.
- Physics (52 theorems): Classical mechanics (Noether), electromagnetism (Maxwell), quantum mechanics, thermodynamics.
- EE&CS (48 theorems): Signal processing (Nyquist–Shannon), coding theory (Huffman), algorithms, control theory.
- Finance (55 theorems): Asset pricing (Black–Scholes), risk, annuities.
Questions are drawn from expert rewriting and textbook sources, focusing on problems where a named theorem must be applied in a nontrivial solution pipeline. Multimodal (image-based) questions account for 6\% of the benchmark. No fixed train/test split is mandated; a typical practice is a random 80/10/10 split for reproducibility (Chen et al., 2023).
2. Theorem Statement Encoding and Example Problems
Every question in TheoremQA-T is anchored to a canonical theorem with a LaTeX-formatted statement. Examples include:
- Taylor’s Theorem: For -times differentiable on ,
for some .
- Lagrange’s Theorem (Group Order):
- Huffman Coding Theorem:
for a prefix-free code on symbol set .
Example problem walkthroughs illustrate mapping the natural language question to formal theorem invocation and extracting an answer in a simple numeric format. For instance, using Stokes’ theorem to evaluate the circulation of a vector field, or Wiener process theorems for covariance computations.
3. Evaluation Protocols and Prompting Strategies
Evaluation proceeds via two principal modes:
- Chain-of-Thought (CoT): The model generates stepwise natural language reasoning culminating in a direct answer.
- Program-of-Thought (PoT): The model outputs an executable Python program implementing the solution pipeline, which is sandboxed to compute the final answer.
Post-processing includes span extraction and normalization through external APIs (e.g., WolframAlpha), ensuring strict exact-match scoring. Four main metrics are reported: overall accuracy, field-wise accuracy, type-wise accuracy, and program executability (for PoT).
| Model (PoT) | Math | CS/EE | Physics | Finance | Overall |
|---|---|---|---|---|---|
| GPT-4 | 52.0 | 51.4 | 45.8 | 66.7 | 52.4 |
| ChatGPT | 35.7 | 35.6 | 26.7 | 49.4 | 35.6 |
| Open-source LMs | <15 | <28 | <6 | <13 | <15 |
| Random Baseline | 10.0 | 24.7 | 0.0 | 4.9 | 10.5 |
GPT-4 achieves top accuracy with PoT prompting (52.4%), while open-source models perform marginally above random. Finance questions are the easiest; physics the hardest. Error breakdown attributes most GPT-4 failures to minor calculation mistakes, while open models largely fail due to lack of theorem knowledge (Chen et al., 2023).
4. RL-Based Proof Decomposition and Hierarchical Reasoning
Recent advancements (Dong et al., 4 Nov 2024) propose RL frameworks that enhance LLM performance on theorem-solving tasks by rewarding explicit hierarchical decomposition:
- Conditional Proofs: Models can insert lemma proposals using explicit tokens (e.g.,
<invoke>…</invoke>), triggering recursive proof-tree construction. - Hierarchical Reward Design: Partial progress is rewarded; any correct sub-goal or lemma earns positive reinforcement even if the full theorem is not solved.
- Value Function: guides exploratory lemma proposals toward promising directions.
- Hindsight Augmentation: Novel, correct lemmas found during RL are recycled into the training buffer and augmented as ground-truth, massively expanding the effective sample space (with 37.7% of buffer lemmas being novel).
- Implementation: The policy is trained via REINFORCE on weighted samples, minimizing:
Empirically, RL with product-decomposition and lemma rewards (ProD-RL) improves pass@16 from 40.8% (SFT) to 45.5% on held-out AFP and from 36.5% to 39.5% out-of-distribution (Dong et al., 4 Nov 2024).
5. Implications and Adaptation for TheoremQA-T
Applying RL-based hierarchical proof strategies to TheoremQA-T suggests incorporating:
- Intermediate Sub-goal Tokens: Allow models to explicitly propose and pursue lemma-like sub-questions in both natural language and formalized forms.
- Partial Success Rewards: Structure data generation, training, and evaluation to reward any correct sub-step (not just final answers).
- Hybrid Verification: Substitute full formal verification (e.g., Isabelle) with lightweight, automatic checkers or neural verifiers for natural language sub-questions.
- Depth Limiting: Impose maximum proof-tree depth to manage recursion in problem decomposition for complex theorem-based tasks.
- Replay Buffer Expansion: Retain novel lemma solutions for future sample augmentation, facilitating continual learning.
A plausible implication is that these methods, when adapted, can substantially enhance LLM pass rates on TheoremQA-T, particularly for open-source models currently bottlenecked by knowledge and reasoning granularity (Dong et al., 4 Nov 2024). Integration of structured RL with hierarchical decomposition transforms TheoremQA-T evaluations from atomic answer scoring to multi-step scientific reasoning with explicit theorem invocation.
6. Directions for Research and Benchmark Evolution
TheoremQA-T establishes best practices for rigorous evaluation, favoring Program-of-Thought prompting with normalization for robust automated scoring. Recommendations for researchers include reporting both CoT and PoT performance, full type/category breakdowns, and executability statistics. Promising extensions encompass:
- Symbolic solutions (formulas, matrices) beyond simple scalar answers.
- Deep integration of symbolic theorem statements in prompts.
- Advanced handling of multimodal (diagram) questions.
- Science-focused pretraining for open-source models to narrow the gap with closed-source leaders.
- Dynamic, multi-theorem reasoning or retrieval-based decomposition.
Ongoing work challenges models to generalize not only across domains but also across problem formats and reasoning types, making TheoremQA-T a principled arena for architectural, algorithmic, and dataset advances in theorem-driven question answering (Chen et al., 2023, Dong et al., 4 Nov 2024).