Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models for Mathematical Reasoning: Progresses and Challenges (2402.00157v4)

Published 31 Jan 2024 in cs.CL
Large Language Models for Mathematical Reasoning: Progresses and Challenges

Abstract: Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of LLMs geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

Introduction

The landscape of mathematical reasoning has been substantially impacted by the rise of LLMs, which have demonstrated impressive capabilities in solving a range of mathematical problems. This paper provides a comprehensive survey of the current state of LLMs in mathematical problem-solving, laying out the diverse problem types and datasets that have been explored, as well as the techniques put in place for this purpose.

Mathematical Problem Types and Datasets

The survey categorizes mathematical problems tackled by LLMs into several domains: Arithmetic, Math Word Problems (MWP), Geometry, Automated Theorem Proving (ATP), and Math in the Vision-Language Context. Each domain presents its unique challenges and datasets. The paper details the characteristics of these problems, from the straightforward arithmetic operations to the intricate MWPs requiring textual comprehension and step-by-step reasoning. Moreover, it outlines how MWPs can vary widely, offering examples and listing key datasets, such as SVAMP and MAWPS, which aid in training and benchmarking LLMs’ mathematical abilities.

Methodologies for Enhancing LLMs’ Capabilities

The paper delineates the various methodologies deployed to augment LLMs for mathematical reasoning. These range from mere prompting of pre-trained models to more intricate techniques like fine-tuning on specialized datasets. Among the methodologies discussed is the use of external tools to verify answers, advanced prompting methods like Chain-of-Thought, which improves models’ reasoning steps, and fine-tuning strategies that entail improving intermediate step generation and learning from enhanced datasets. Consideration is also given to teacher-student knowledge distillation, emphasizing the potential for making smaller models with high proficiency in solving math problems.

Analysis and Challenges

The robustness of LLMs in mathematics is particularly scrutinized, revealing a disparity in models' abilities to maintain performance under the variation of inputs. Factors influencing LLMs in math are also examined, such as prompt efficiency, tokenization methods, and model scale, contributing to a comprehensive understanding of LLMs' arithmetic capabilities. Despite notable advancements, challenges persist in the form of LLMs' brittleness in mathematical reasoning and their limited generalization beyond data-driven approaches. Furthermore, there is a salient need for a human-centered design in LLMs to ensure usability in educational settings, addressing aspects of user comprehension and adaptive feedback.

Educational Impact and Outlook

The implications of utilizing LLMs for mathematics within educational contexts are multifaceted, with LLMs having the potential to serve as powerful tools for aiding in learning and instruction. However, the current approaches often do not address the uniqueness of individual student needs or learning styles, nor do they consider the complexity or practicality of responses in line with students’ cognitive abilities. This paper calls for a delicate balance between machine efficiency and human-centric design, to ensure that LLMs serve as effective educational supplements.

In conclusion, the survey presents an intricate tapestry of achievements and challenges in the interplay between LLMs and mathematical reasoning. LLMs have proven their worth in various mathematical domains, yet the quest for more robust, adaptive, and human-oriented solutions continues to be a dynamic area of research and development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Synthesis of solutions for shaded area geometry problems. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2017, Marco Island, Florida, USA, May 22-24, 2017, pages 14–19. AAAI Press.
  2. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of NAACL-HLT, pages 2357–2367.
  3. Does chatgpt comprehend the place value in numbers when solving math word problems? In Proceedings of the Workshop ”Towards the Future of AI-augmented Human Tutoring in Math Learning” co-located with The 24th International Conference on Artificial Intelligence in Education (AIED 2023), Tokyo, Japan, July 3, 2023, volume 3491 of CEUR Workshop Proceedings, pages 49–58.
  4. Learning from mistakes makes LLM better reasoner. CoRR, abs/2310.20689.
  5. Palm 2 technical report. CoRR, abs/2305.10403.
  6. Llemma: An open language model for mathematics.
  7. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  8. Holist: An environment for machine learning of higher-order theorem proving.
  9. Solving math word problems with reexamination. CoRR, abs/2310.09590.
  10. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow.
  11. A survey on evaluation of large language models. CoRR, abs/2307.03109.
  12. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of EMNLP, pages 3313–3323.
  13. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of ACL/IJCNLP, volume ACL/IJCNLP 2021, pages 513–523.
  14. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  15. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
  16. Theoremqa: A theorem-driven question answering dataset. In Proceedings of EMNLP, pages 7889–7901.
  17. Analyzing ChatGPT’s mathematical deficiencies: Insights and contributions. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), pages 188–193.
  18. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  19. Fill in the blank: Exploring and enhancing LLM capabilities for backward reasoning in math word problems. CoRR, abs/2310.01991.
  20. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32):e2123433119.
  21. Mathematical capabilities of chatgpt. CoRR, abs/2301.13867.
  22. Exploring pre-service teachers’ perceptions of large language models-generated hints in online mathematics learning.
  23. Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in large language models through symbolic math word problems. In Findings of ACL, pages 5889–5903.
  24. Gina Gresham. 2021. Exploring exceptional education preservice teachers’ mathematics anxiety. International Journal for the Scholarship of Teaching and Learning, 15.
  25. Sophia Gu. 2023. Llms as potential brainstorming partners for math and science problems. CoRR, abs/2310.10677.
  26. Proof artifact co-training for theorem proving with language models. In Proceedings of ICLR.
  27. Deberta: decoding-enhanced bert with disentangled attention. In Proceedings of ICLR.
  28. Solving math word problems by combining language models with symbolic solvers. CoRR, abs/2304.09102.
  29. Measuring mathematical problem solving with the MATH dataset. In Proceedings of NeurIPS.
  30. Learning to solve arithmetic word problems with verb categorization. In Proceedings of EMNLP, pages 523–533. ACL.
  31. Mathprompter: Mathematical reasoning using large language models. In Proceedings of ACL, pages 37–42.
  32. Parsing algebraic word problems into equations. Trans. Assoc. Comput. Linguistics, 3:585–597.
  33. MAWPS: A math word problem repository. In Proceedings of NAACL, pages 1152–1157.
  34. Solving quantitative reasoning problems with language models.
  35. Mint: Boosting generalization in mathematical reasoning via multi-view fine-tuning. CoRR, abs/2307.07951.
  36. Let GPT be a math tutor: Teaching math word problem solvers with customized exercise generation. CoRR, abs/2305.14386.
  37. Let’s verify step by step. CoRR, abs/2305.20050.
  38. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of ACL, pages 158–167.
  39. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  40. Mathematical language models: A survey. CoRR, abs/2312.07622.
  41. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  42. Improving large language model fine-tuning for solving math problems. CoRR, abs/2310.10047.
  43. Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
  44. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. CoRR, abs/2310.02255.
  45. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of ACL/IJCNLP, pages 6774–6786.
  46. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proceedings of ICLR.
  47. A survey of deep learning for mathematical reasoning. In Proceedings of ACL, pages 14605–14631.
  48. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583.
  49. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, pages 2263–2279.
  50. Learning mathematics with large language models: A comparative study with computer algebra systems and other tools. International Journal of Emerging Technologies in Learning (iJET), 18(20):51–71.
  51. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of ACL, pages 975–984.
  52. LILA: A unified benchmark for mathematical reasoning. In Proceedings of EMNLP, pages 5807–5832.
  53. Rewriting math word problems with large language models. In Proceedings of the Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation 2023 co-located with 24th International Conference on Artificial Intelligence in Education (AIED 2023), Tokyo, Japan, July 7, 2023, volume 3487 of CEUR Workshop Proceedings, pages 163–172.
  54. Show your work: Scratchpads for intermediate computation with language models. CoRR, abs/2112.00114.
  55. Training language models to follow instructions with human feedback. In NeurIPS.
  56. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL-HLT, pages 2080–2094.
  57. Neural-symbolic solver for math word problems with auxiliary tasks. In Proceedings of ACL/IJCNLP, pages 5870–5881.
  58. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  59. Math word problem solving by generating linguistic variants of problem statements. CoRR, abs/2306.13899.
  60. Nitin Rane. 2023. Enhancing mathematical capabilities through chatgpt and similar generative artificial intelligence: Roles and challenges in solving mathematical problems. SSRN Electronic Journal.
  61. Mathematical discoveries from program search with large language models. Nature, pages 1–3.
  62. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of EMNLP, pages 1743–1752.
  63. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  64. From textbooks to knowledge: A case study in harvesting axiomatic knowledge from textbooks to solve geometry problems. In Proceedings of EMNLP, pages 773–784.
  65. Mrinmaya Sachan and Eric P. Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of *SEM @ACM, pages 251–261.
  66. ARB: advanced reasoning benchmark for large language models. CoRR, abs/2307.13692.
  67. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of EMNLP, pages 1466–1476.
  68. An independent evaluation of chatgpt on mathematical word problems (MWP). In Proceedings of the AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering (AAAI-MAKE 2023), Hyatt Regency, San Francisco Airport, California, USA, March 27-29, 2023, volume 3433 of CEUR Workshop Proceedings.
  69. A causal framework to quantify the robustness of mathematical reasoning with language models. In Proceedings of ACL, pages 545–561.
  70. Galactica: A large language model for science. CoRR, abs/2211.09085.
  71. Alberto Testolin. 2023. Can neural networks do arithmetic? A survey on the elementary numerical skills of state-of-the-art deep learning models. CoRR, abs/2303.07735.
  72. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  73. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  74. Solving olympiad geometry without human demonstrations. Nature.
  75. Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of EACL, pages 494–504.
  76. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  77. Self-consistency improves chain of thought reasoning in language models. In Proceedings of ICLR.
  78. Deep neural solver for math word problems. In Proceedings of EMNLP, pages 845–854.
  79. Math word problem generation with mathematical consistency and problem context constraints. In Proceedings of EMNLP, pages 5986–5999.
  80. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS.
  81. CMATH: can your language model pass chinese elementary school math test? CoRR, abs/2306.16636.
  82. The isabelle framework. In Theorem Proving in Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21, pages 33–38. Springer.
  83. An empirical study on challenging math problem solving with GPT-4. CoRR, abs/2306.01337.
  84. LPML: llm-prompting markup language for mathematical reasoning. CoRR, abs/2309.13078.
  85. Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants.
  86. GPT can solve mathematical problems without a calculator. CoRR, abs/2309.03241.
  87. Solving math word problem with problem type classification. In Proceedings of NLPCC, volume 14304, pages 123–134.
  88. An-Zi Yen and Wei-Ling Hsu. 2023. Three questions concerning the use of large language models to facilitate mathematics learning. CoRR, abs/2310.13615.
  89. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284.
  90. How well do large language models perform in arithmetic tasks? CoRR, abs/2304.02015.
  91. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
  92. GLM-130B: an open bilingual pre-trained model. In Proceedings of ICLR.
  93. Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint arXiv:2306.02408.
  94. Interpretable math word problem solution generation via step-by-step planning. In Proceedings of ACL, pages 6858–6877.
  95. Ape210k: A large-scale and template-rich dataset of math word problems.
  96. Minif2f: a cross-system benchmark for formal olympiad-level mathematics.
  97. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364.
  98. Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. CoRR, abs/2308.07921.
  99. Mathattack: Attacking large language models towards math solving ability. CoRR, abs/2309.01686.
  100. Solving math word problems via cooperative reasoning induced language models. In Proceedings of ACL, pages 4471–4485.
  101. Mingyu Zong and Bhaskar Krishnamachari. 2023. Solving math word problems concerning systems of equations with GPT-3. In Proceedings of AAAI, pages 15972–15979.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Janice Ahn (2 papers)
  2. Rishu Verma (5 papers)
  3. Renze Lou (18 papers)
  4. Di Liu (107 papers)
  5. Rui Zhang (1138 papers)
  6. Wenpeng Yin (69 papers)
Citations (63)