TheoremQA: A Theorem-driven Question Answering dataset

Published 21 May 2023 in cs.CL and cs.AI | (2305.12524v3)

Abstract: The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

Abstract PDF HTML Upgrade to Chat

Authors (9)

References (61)

Citations (77)

View on Semantic Scholar

Summary

The paper presents TheoremQA, a diverse dataset of 800 questions covering 350 theorems to benchmark complex scientific reasoning.
The paper evaluates 16 LLMs using advanced prompting strategies, with GPT-4 achieving 51% accuracy through Program-of-Thought prompting.
The paper highlights the need for improved pre-training and multimodal integration to enhance LLMs’ performance on theorem-driven tasks.

TheoremQA: A Theorem-Driven Question Answering Dataset

This paper introduces TheoremQA, a novel benchmark dataset developed to evaluate the capabilities of LLMs in applying scientific theorems to solve complex problems in fields such as Mathematics, Physics, Electrical Engineering, and Finance. The authors curated a dataset comprising 800 high-quality questions that span 350 theorems, aiming to address the limitations of existing question answering (QA) datasets that often lack domain-specific knowledge and complexity. This paper provides a framework for understanding the efficacy of LLMs when confronted with theorem-driven questions, highlighting the performance of various models and prompting strategies.

Key Contributions

TheoremQA presents several significant contributions to the field of AI-driven QA systems:

Dataset Composition: The dataset fills a critical gap by incorporating university-level theorems from a broad range of scientific domains. This variety distinguishes TheoremQA from previous datasets focused on fundamental math skills.
Model Evaluation: The paper evaluates 16 large language and code models using advanced prompting strategies, such as Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT). GPT-4, with its advanced capabilities, achieved an accuracy of 51% using PoT prompting, significantly outperforming other models. The evaluation demonstrates GPT-4's upper limits in tackling complex theorems.
Implications for Open-Source Models: The stark performance discrepancy between GPT-4 and open-source models underscores the need for further advancements in pre-training and tuning methods, particularly in integrating scientific knowledge more deeply into model architectures.
Error Analysis and Theoretical Insights: Through error analysis, the authors identify areas where even advanced models like GPT-4 face challenges. Most errors were minor, suggesting that with improved prompting strategies, performance could be enhanced.
Multimodal Challenges: The study also probes the challenges of integrating multimodal inputs, revealing current limitations with visual data and indicating areas for future research.

Experimental Insights

The evaluation of LLMs on TheoremQA provided several notable insights:

Prompting Strategy Efficacy: CoT and PoT lead to different performance outcomes, with PoT generally enhancing accuracy by reliably formulating a computational path. GPT-4 benefited significantly from PoT, emphasizing the merits of incorporating symbolic execution in reasoning tasks.
Performance Gap: A pronounced gap in performance was observed between proprietary models, like GPT-4, and open-source counterparts, illustrating the advanced capabilities of proprietary LLMs in reasoning and comprehension tasks.
Multimodal Processing: Models struggled substantially with multimodal queries, primarily due to the current limitations of visual transformers in handling complex scientific illustrations.

Future Implications and Research Directions

TheoremQA serves as a foundational step towards improving AI systems' proficiency in handling theorem-based scientific inquiries. The paper underlines several future research directions:

Advanced Pre-Training Techniques: There is an opportunity to close the performance disparity between open-source models and GPT-4 through domain-specific pre-training and fine-tuning strategies.
Multimodal Development: Enhancements in processing visual data and integrating it with textual reasoning remain crucial. Developing sophisticated methodologies for visual input encoding could remove existing bottlenecks.
Refined Evaluation Metrics: As models increasingly engage in complex reasoning, more robust and nuanced evaluation metrics are necessary to accurately capture performance.

TheoremQA represents a substantial contribution towards understanding and advancing LLMs' capabilities in tackling scientifically rigorous tasks. By establishing a benchmark for theorem-specific question answering, the paper lays the groundwork for subsequent improvements in both model sophistication and dataset development.

Markdown Report Issue