From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting (2401.05384v1)
Abstract: This paper investigates the performance of LLMs and Tool-augmented LLMs in tackling complex mathematical reasoning tasks. We introduce IMP-TIP: Improving Math Reasoning with Tool-augmented Interleaf Prompting, a framework that combines the strengths of both LLMs and Tool-augmented LLMs. IMP-TIP follows the ``From Good to Great" concept, collecting multiple potential solutions from both LLMs and their Tool-Augmented counterparts for the same math problem, and then selecting or re-generating the most accurate answer after cross-checking these solutions via tool-augmented interleaf prompting. The framework incorporates two key aspects: self-prompt and tool-augmented interleaf prompting (TIP). The former allows LLMs to autonomously refine and improve an initial prompt related to tool usage, while the latter enables LLMs to derive the final answer by dynamically analyzing the problem, cross-checking potential solutions, and revising previous reasoning hints in an interleaved manner. Experimental analysis shows that IMP-TIP achieves enhanced mathematical capabilities and outperforms traditional LLMs and tool-augmented LLMs in accuracy and reasoning diversity on math reasoning tasks. For instance, IMP-TIP can improve Tool-augmented ChatGPT on GSM8K-Hard from 56.0% to 65.2%.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Orca: A few-shot benchmark for chinese conversational machine reading comprehension. arXiv preprint arXiv:2302.13619.
- What would harry say? building dialogue agents for characters in a story. arXiv preprint arXiv:2211.06869.
- Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Mathprompter: Mathematical reasoning using large language models. In ACL (industry), pages 37–42. Association for Computational Linguistics.
- MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
- Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333.
- A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535.
- Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
- OpenAI. 2023. Gpt-4 technical report.
- Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- End-to-end spoken conversational question answering: Task, dataset and model. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1219–1232, Seattle, United States. Association for Computational Linguistics.
- How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Nuo Chen (100 papers)
- Hongguang Li (13 papers)
- Baoyuan Wang (46 papers)
- Jia Li (380 papers)