Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TheoremQA: A Theorem-driven Question Answering dataset (2305.12524v3)

Published 21 May 2023 in cs.CL and cs.AI

Abstract: The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  10. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  11. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  14. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003.
  15. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
  16. Google. 2023. Palm 2 technical report. https://ai.google/static/documents/palm2techreport.pdf.
  17. Measuring mathematical problem solving with the math dataset. Conference on Neural Information Processing Systems.
  18. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  19. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533.
  20. Wolfram Research, Inc. Mathematica, Version 13.2. Champaign, IL, 2022.
  21. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  22. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  23. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152–1157.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  25. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  26. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  28. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774–6786, Online. Association for Computational Linguistics.
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  30. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  31. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
  32. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.
  33. A survey of deep learning for mathematical reasoning. In The 61st Annual Meeting of the Association for Computational Linguistics (ACL).
  34. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
  35. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  36. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
  37. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop.
  38. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  39. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  40. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
  41. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  43. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  44. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752.
  45. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557.
  46. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1466–1476, Lisbon, Portugal. Association for Computational Linguistics.
  47. Task ambiguity in humans and language models. arXiv preprint arXiv:2212.10711.
  48. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  50. Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: A challenging and diverse algebra word problem set. Technical report, Citeseer.
  51. Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 494–504, Valencia, Spain. Association for Computational Linguistics.
  52. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  53. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854.
  54. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  55. Emergent abilities of large language models. Transactions on Machine Learning Research.
  56. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  57. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  58. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  59. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797.
  60. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  61. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Wenhu Chen (134 papers)
  2. Ming Yin (70 papers)
  3. Max Ku (11 papers)
  4. Pan Lu (42 papers)
  5. Yixin Wan (19 papers)
  6. Xueguang Ma (36 papers)
  7. Jianyu Xu (11 papers)
  8. Xinyi Wang (152 papers)
  9. Tony Xia (5 papers)
Citations (77)

Summary

  • The paper presents TheoremQA, a diverse dataset of 800 questions covering 350 theorems to benchmark complex scientific reasoning.
  • The paper evaluates 16 LLMs using advanced prompting strategies, with GPT-4 achieving 51% accuracy through Program-of-Thought prompting.
  • The paper highlights the need for improved pre-training and multimodal integration to enhance LLMs’ performance on theorem-driven tasks.

TheoremQA: A Theorem-Driven Question Answering Dataset

This paper introduces TheoremQA, a novel benchmark dataset developed to evaluate the capabilities of LLMs in applying scientific theorems to solve complex problems in fields such as Mathematics, Physics, Electrical Engineering, and Finance. The authors curated a dataset comprising 800 high-quality questions that span 350 theorems, aiming to address the limitations of existing question answering (QA) datasets that often lack domain-specific knowledge and complexity. This paper provides a framework for understanding the efficacy of LLMs when confronted with theorem-driven questions, highlighting the performance of various models and prompting strategies.

Key Contributions

TheoremQA presents several significant contributions to the field of AI-driven QA systems:

  1. Dataset Composition: The dataset fills a critical gap by incorporating university-level theorems from a broad range of scientific domains. This variety distinguishes TheoremQA from previous datasets focused on fundamental math skills.
  2. Model Evaluation: The paper evaluates 16 large language and code models using advanced prompting strategies, such as Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT). GPT-4, with its advanced capabilities, achieved an accuracy of 51% using PoT prompting, significantly outperforming other models. The evaluation demonstrates GPT-4's upper limits in tackling complex theorems.
  3. Implications for Open-Source Models: The stark performance discrepancy between GPT-4 and open-source models underscores the need for further advancements in pre-training and tuning methods, particularly in integrating scientific knowledge more deeply into model architectures.
  4. Error Analysis and Theoretical Insights: Through error analysis, the authors identify areas where even advanced models like GPT-4 face challenges. Most errors were minor, suggesting that with improved prompting strategies, performance could be enhanced.
  5. Multimodal Challenges: The paper also probes the challenges of integrating multimodal inputs, revealing current limitations with visual data and indicating areas for future research.

Experimental Insights

The evaluation of LLMs on TheoremQA provided several notable insights:

  • Prompting Strategy Efficacy: CoT and PoT lead to different performance outcomes, with PoT generally enhancing accuracy by reliably formulating a computational path. GPT-4 benefited significantly from PoT, emphasizing the merits of incorporating symbolic execution in reasoning tasks.
  • Performance Gap: A pronounced gap in performance was observed between proprietary models, like GPT-4, and open-source counterparts, illustrating the advanced capabilities of proprietary LLMs in reasoning and comprehension tasks.
  • Multimodal Processing: Models struggled substantially with multimodal queries, primarily due to the current limitations of visual transformers in handling complex scientific illustrations.

Future Implications and Research Directions

TheoremQA serves as a foundational step towards improving AI systems' proficiency in handling theorem-based scientific inquiries. The paper underlines several future research directions:

  • Advanced Pre-Training Techniques: There is an opportunity to close the performance disparity between open-source models and GPT-4 through domain-specific pre-training and fine-tuning strategies.
  • Multimodal Development: Enhancements in processing visual data and integrating it with textual reasoning remain crucial. Developing sophisticated methodologies for visual input encoding could remove existing bottlenecks.
  • Refined Evaluation Metrics: As models increasingly engage in complex reasoning, more robust and nuanced evaluation metrics are necessary to accurately capture performance.

TheoremQA represents a substantial contribution towards understanding and advancing LLMs' capabilities in tackling scientifically rigorous tasks. By establishing a benchmark for theorem-specific question answering, the paper lays the groundwork for subsequent improvements in both model sophistication and dataset development.

X Twitter Logo Streamline Icon: https://streamlinehq.com