Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning (2402.17231v3)

Published 27 Feb 2024 in cs.CL

Abstract: Tool-augmented LLMs (TALMs) are known to enhance the skillset of LLMs, thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented LLM for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Skills-in-context prompting: Unlocking compositionality in large language models.
  3. Reconcile: Round-table conference improves reasoning via consensus among diverse llms.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32).
  8. Improving factuality and reasoning in language models through multiagent debate.
  9. Complexity-based prompting for multi-step reasoning.
  10. Pal: Program-aided language models.
  11. Learning to program with natural language.
  12. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  13. Measuring mathematical problem solving with the math dataset. NeurIPS.
  14. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey.
  15. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  16. Encouraging divergent thinking in large language models through multi-agent debate.
  17. Program induction by rationale generation : Learning to solve and explain algebraic word problems.
  18. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  19. A survey of deep learning for mathematical reasoning.
  20. Self-refine: Iterative refinement with self-feedback.
  21. A diverse corpus for evaluating and developing english math word problem solvers.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  25. Toolformer: Language models can teach themselves to use tools.
  26. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  27. Challenging big-bench tasks and whether chain-of-thought can solve them.
  28. Fever: a large-scale dataset for fact extraction and verification.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Olagpt: Empowering llms with human-like problem-solving abilities.
  31. Runzhe Yang and Karthik Narasimhan. 2023. The Socratic Method for Self-Discovery in Large Language Models. https://princeton-nlp.github.io/SocraticAI/.
  32. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
  33. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  34. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  35. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models.
  36. Progressive-hint prompting improves reasoning in large language models.
  37. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Debrup Das (3 papers)
  2. Debopriyo Banerjee (8 papers)
  3. Somak Aditya (25 papers)
  4. Ashish Kulkarni (8 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.