S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners (2409.01524v1)
Abstract: Self-correction is a novel method that can stimulate the potential reasoning abilities of LLMs. It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
- Step-Level Value Preference Optimization for Mathematical Reasoning.
- Training Verifiers to Solve Math Word Problems.
- LM vs LM: Detecting Factual Errors via Cross Examination. https://arxiv.org/abs/2305.13281v1.
- Improving Factuality and Reasoning in Language Models through Multiagent Debate.
- Measuring Mathematical Problem Solving With the MATH Dataset.
- Active Retrieval Augmented Generation. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 7969–7992. Singapore: Association for Computational Linguistics.
- When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs.
- Efficient Memory Management for Large Language Model Serving with PagedAttention.
- DotaMath: Decomposition of Thought with Code Assistance and Self-Correction for Mathematical Reasoning.
- Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706.
- PRD: Peer Rank and Discussion Improve Large Language Model Based Evaluations.
- Let’s Verify Step by Step.
- Large Language Models Have Intrinsic Self-Correction Ability.
- Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003.
- Self-Refine: Iterative Refinement with Self-Feedback.
- Meta AI. 2024. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. https://ai.meta.com/blog/meta-llama-3/.
- Mistral AI team. 2023. Mistral 7B. https://mistral.ai/news/announcing-mistral-7b/.
- Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies. Transactions of the Association for Computational Linguistics, 12: 484–506.
- Are NLP Models Really Able to Solve Simple Math Word Problems?
- REFINER: Reasoning Feedback on Intermediate Representations. In Graham, Y.; and Purver, M., eds., Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 1100–1126. St. Julian’s, Malta: Association for Computational Linguistics.
- Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models.
- Qwen Team. 2024. Introducing Qwen2-Math | Qwen. https://qwenlm.github.io/blog/qwen2-math/.
- Analysing Mathematical Reasoning Abilities of Neural Models.
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
- A Survey of Reasoning with Foundation Models.
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.
- Evidence-Based Factual Error Correction. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 3298–3309. Online: Association for Computational Linguistics.
- LLMs Cannot Find Reasoning Errors, but Can Correct Them given the Error Location. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 13894–13908. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics.
- Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models.
- Chain-of-Thought Reasoning Without Prompting.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
- Can We Verify Step by Step for Incorrect Answer Detection? arXiv preprint arXiv:2402.10528.
- Can LLMs Solve longer Math Word Problems Better? arXiv preprint arXiv:2405.14804.
- InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search.
- Small Language Models Need Strong Verifiers to Self-Correct Reasoning.
- OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement.
- OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement.
- JiuZhang3. 0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models. arXiv preprint arXiv:2405.14365.
- JiuZhang3. 0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models. arXiv preprint arXiv:2405.14365.
- Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model. arXiv preprint arXiv:2407.10167.
- Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model. arXiv preprint arXiv:2407.10167.
- Yuchen Yan (44 papers)
- Jin Jiang (17 papers)
- Yang Liu (2253 papers)
- Yixin Cao (138 papers)
- Xin Xu (188 papers)
- Mengdi Zhang (37 papers)
- Xunliang Cai (63 papers)
- Jian Shao (29 papers)