Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving (2309.17452v4)

Published 29 Sep 2023 in cs.CL and cs.AI

Abstract: LLMs have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
  3. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023. doi: 10.48550/arXiv.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
  4. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  535–541, 2006.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
  7. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  8. Computers and thought, volume 7. New York McGraw-Hill, 1963.
  9. Specializing smaller language models towards multi-step reasoning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 10421–10430. PMLR, 2023. URL https://proceedings.mlr.press/v202/fu23d.html.
  10. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  11. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
  12. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  13. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  14. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14852–14882, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.830. URL https://aclanthology.org/2023.acl-long.830.
  15. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  523–533, 2014.
  16. Large language models can self-improve. CoRR, abs/2210.11610, 2022. doi: 10.48550/arXiv.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610.
  17. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  18. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
  19. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5315–5333, 2023.
  20. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
  21. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DHyHRBwJUTN.
  22. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  23. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  24. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  975–984, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/2020.acl-main.92.
  25. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  26. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
  29. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  30. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  31. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  32. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  33. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14, 2021.
  34. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  35. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  36. Chaining simultaneous thoughts for numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 2533–2547. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-emnlp.187. URL https://doi.org/10.18653/v1/2022.findings-emnlp.187.
  37. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 30706–30775. PMLR, 2023a. URL https://proceedings.mlr.press/v202/shao23a.html.
  38. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. CoRR, abs/2305.15294, 2023b. doi: 10.48550/arXiv.2305.15294. URL https://doi.org/10.48550/arXiv.2305.15294.
  39. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  41. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  42. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  43. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X.
  44. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  45. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  46. Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint arXiv:2306.02408, 2023.
  47. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=WZH7099tgfM.
  48. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4471–4485, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.245. URL https://aclanthology.org/2023.acl-long.245.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhibin Gou (15 papers)
  2. Zhihong Shao (20 papers)
  3. Yeyun Gong (78 papers)
  4. Yelong Shen (83 papers)
  5. Yujiu Yang (155 papers)
  6. Minlie Huang (226 papers)
  7. Nan Duan (172 papers)
  8. Weizhu Chen (128 papers)
Citations (103)

Summary

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

The paper "ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" addresses the challenges faced by open-source LLMs in advanced mathematical reasoning tasks. The authors introduce ToRA, which stands for Tool-integrated Reasoning Agents, a series of models that integrate natural language reasoning with program-based tool use. This combination aims to leverage the semantic and abstract reasoning capabilities of LLMs alongside the precise computational abilities of external tools.

Approach

The authors developed ToRA by enhancing open-source models to interleave natural language reasoning with program-based tool use. This method draws from two primary approaches:

  1. Rationale-Based Methods: Step-by-step natural language reasoning.
  2. Program-Based Methods: Solving tasks by synthesizing and executing programs.

ToRA aims to synergize these methods by generating comprehensive annotations (interactive tool-use trajectories) for mathematical problems and applying imitation learning on these annotations. The key components of the approach include:

  • Curating Tool-Use Trajectories: Using GPT-4 to generate high-quality trajectories for mathematical problems from datasets like GSM8k and MATH.
  • Imitation Learning: Training models on curated datasets to understand and utilize interactive tool-use trajectories.
  • Output Space Shaping: Enhancing the model's ability to explore diverse valid trajectories through additional training on sampled and corrected outputs.

Experimental Results

ToRA models were evaluated on ten diverse mathematical reasoning datasets. The results indicated significant performance improvements over previous state-of-the-art models. Key findings include:

  • Significant Improvements: ToRA models showed 13%-19% absolute improvements on average compared to existing open-source models.
  • Exceptional Performance: ToRA-7B achieved 44.6% accuracy on the competition-level MATH dataset, which is a 22% absolute improvement over the best previous open-source model, WizardMath-70B.
  • Open-Source Achievements: ToRA-Code-34B became the first open-source model to surpass 50% accuracy on the MATH dataset, competing closely with GPT-4’s performance.

Implications

The results suggest several important implications for AI and mathematical problem solving:

  • Synergistic Reasoning: Integrating natural language reasoning with program-based tool use can significantly enhance the problem-solving capabilities of LLMs, especially in complex domains like mathematics.
  • Training Strategies: Imitation learning combined with output space shaping presents a promising approach to training more flexible and capable models.
  • Open-Source Advantages: Achieving state-of-the-art performance with open-source models opens new avenues for widespread access and research in mathematical reasoning.

Future Directions

This research paves the way for exploring several future directions in the field of AI and mathematical problem solving:

  1. Enhanced Tool Use: Expanding the range of external tools and improving the integration mechanism could further increase the models' performance.
  2. Generalization: Understanding and overcoming the remaining challenges in generalization to out-of-distribution tasks.
  3. Complex Reasoning: Developing methods to handle even more complex reasoning steps, including diagram understanding and multi-step problem solving.
  4. Interactive Learning: Introducing more dynamic interaction protocols during training to simulate more realistic problem-solving scenarios.

Overall, ToRA's development and the accompanying results highlight the substantial potential of combining various reasoning strategies to enhance the capabilities of AI models in specialized domains. This research sets a strong foundation for future advancements in AI-driven mathematical reasoning.

Youtube Logo Streamline Icon: https://streamlinehq.com