Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning From Mistakes Makes LLM Better Reasoner (2310.20689v4)

Published 31 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs recently exhibited remarkable reasoning capabilities on solving math problems. To further improve their reasoning capabilities, this work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process. Consider a human student who failed to solve a math problem, he will learn from what mistake he has made and how to correct it. Mimicking this error-driven learning process, LEMA incorporates mistake-correction data pairs during fine-tuning LLMs. Specifically, we first collect inaccurate reasoning paths from various LLMs, and then employ GPT-4 as a ''corrector'' to identify the mistake step, explain the reason for the mistake, correct the mistake and generate the final answer. In addition, we apply a correction-centric evolution strategy that effectively expands the question set for generating correction data. Experiments across various LLMs and reasoning tasks show that LEMA effectively improves CoT-alone fine-tuning. Our further ablations shed light on the non-homogeneous effectiveness between CoT data and correction data. These results suggest a significant potential for LLMs to improve through learning from their mistakes. Our code, models and prompts are publicly available at https://github.com/microsoft/LEMA.

Analysis of Learning From Mistakes in LLMs

The paper "Learning From Mistakes Makes LLM Better Reasoner" presents an innovative approach to improving the reasoning capabilities of LLMs in solving mathematical problems by mimicking a fundamental aspect of human learning: learning from errors. The authors introduce a fine-tuning protocol called LEarning from MistAkes (LEMA), where LLMs are trained on mistake-correction data pairs generated by GPT-4, enhancing the models' capacity for chain-of-thought (CoT) reasoning.

Methodology and Experiments

The core innovation of LEMA is the development of mistake-correction data, an auxiliary dataset complementing existing CoT data traditionally used for training. The process involves two stages:

  1. Correction Data Generation: This involves capturing inaccurate reasoning paths from varied LLM outputs and then using GPT-4 to perform three tasks for each error: identify the mistake, explain its nature, and propose a corrected solution. These corrections address specific reasoning errors and are filtered rigorously to ensure accuracy in the final solutions.
  2. Fine-Tuning Framework: The LLMs are fine-tuned using a combination of CoT data and the newly generated correction data. The experimental evaluation is conducted across different backbone LLMs—including LLaMA-2, WizardMath, and MetaMath—on mathematical reasoning challenges like GSM8K and MATH datasets.

Results indicate that LEMA consistently outperforms traditional CoT fine-tuning across all tested models, including specialized models, with notable gains in pass@1 accuracy—achieving 85.4% on GSM8K and 27.1% on MATH, exceeding current state-of-the-art (SOTA) benchmarks for open-source non-execution models. These improvements are attributed to the distinct information contained in mistake-correction data, which appears to offer a qualitatively different learning signal than CoT data alone.

Implications and Future Developments

The findings have significant implications for the design and augmentation of LLMs, suggesting that incorporating a mistake-driven learning framework can substantially enhance algorithmic reasoning akin to educational techniques applied in human learning. This method also reinforces the role of iterative refinement and feedback in developing more robust AI systems, specifically in handling tasks requiring multi-step logical deductions.

The implications of this research extend beyond mathematical reasoning and suggest potential applications in other domains where structured reasoning is paramount. Future research might explore less computationally intensive methods than GPT-4 for generating corrections, which would democratize access to this technique and scale its application across different domains or model configurations.

Furthermore, the paper highlights that larger models benefit disproportionately from mistake-driven learning, suggesting a potential research avenue into why this disparity exists and how smaller models can be better adapted to learn from such augmented datasets.

In conclusion, this paper presents a credible advancement in leveraging mistake-correction strategies to upscale the reasoning capabilities of LLMs, showcasing the utility of integrating human-like learning paradigms in AI development processes. Such advancements hint at the possibility of enacting more autonomous and analytically efficient AI systems capable of diverse and context-rich decision-making.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shengnan An (12 papers)
  2. Zexiong Ma (7 papers)
  3. Zeqi Lin (25 papers)
  4. Nanning Zheng (146 papers)
  5. Jian-Guang Lou (69 papers)
  6. Weizhu Chen (128 papers)
Citations (58)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com