Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LEMMA: Learning from Errors for MatheMatical Advancement in LLMs (2503.17439v2)

Published 21 Mar 2025 in cs.LG and cs.AI

Abstract: LLMs have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.

Here is a summary of the paper "LEMMA: Learning from Errors for MatheMatical Advancement in LLMs" (Pan et al., 21 Mar 2025 ):

Rationale and Problem Solved

  • Problem: LLMs can solve math problems, but they often make mistakes and aren't good at recognizing or fixing their own errors. Existing methods mostly train LLMs on correct solutions, ignoring the valuable lessons that can be learned from mistakes.
  • Goal: The LEMMA framework aims to improve LLMs' mathematical reasoning by explicitly teaching them how to identify and correct errors during the problem-solving process. It helps models develop a kind of "reflective" ability to fix their own mistakes without needing external help during use.

Data Used

  • Source Data: The research used standard math problem datasets like MATH and GSM8K.
  • Generated Data: The core of LEMMA involves creating a new training dataset. This dataset consists of examples where:

    1. An LLM generates a solution with a mistake.
    2. A more capable "teacher" model (like GPT-4o) identifies the mistake.
    3. The teacher model provides a correction, either by fixing the specific error and continuing ("Fix & Continue") or by starting the solution over correctly ("Fresh Restart").
    4. A "reflection phrase" connects the incorrect part to the corrected part, explaining the error.
  • Size: The base LEMMA dataset contains around 89,000 such error-correction examples. A larger version incorporating data from the MetaMath project contains about 404,000 examples.

Model Architecture

  • LEMMA is not a new model architecture itself. Instead, it's a fine-tuning technique.
  • It takes existing pre-trained LLMs (the paper tested LLaMA3-8B, DeepSeekMath-7B, Mistral-7B, Qwen2-Math-7B) and further trains them on the specially constructed error-correction dataset.

Performance on Benchmarks

  • Significant Improvements: Models fine-tuned with LEMMA showed substantial accuracy improvements on math benchmarks like MATH and GSM8K compared to models trained with standard methods or other self-correction techniques.
  • Better Generalization: LEMMA-trained models performed well not only on the datasets they were trained on but also on new, unseen math problem datasets (OOD datasets like ASDIV, College-Math), indicating better generalization.
  • Enhanced Reflection: The models showed improved abilities on tasks specifically designed to test error correction and follow-up reasoning (MathChat benchmark).
  • Error Reduction: LEMMA successfully reduced the frequency of various types of errors (calculation errors, misunderstanding the question, etc.).

Implications and Possible Applications

  • More Reliable AI Math Solvers: By learning to self-correct, LLMs can become more trustworthy and accurate when solving mathematical problems.
  • Improved AI Tutors: AI systems designed for education could use this technique to better guide students, potentially even explaining common mistakes.
  • Assistants for STEM: Engineers, scientists, and mathematicians could benefit from AI assistants that are less prone to errors in complex calculations and reasoning.
  • Autonomous Reasoning: This method pushes LLMs towards more autonomous reasoning, where they can identify and recover from their own flaws during complex tasks without external intervention.

In conclusion, LEMMA offers a practical way to make LLMs better at math by teaching them to learn directly from their errors, leading to improved accuracy, reliability, and self-correction capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhuoshi Pan (9 papers)
  2. Yu Li (378 papers)
  3. Honglin Lin (9 papers)
  4. Qizhi Pei (17 papers)
  5. Zinan Tang (2 papers)
  6. Wei Wu (482 papers)
  7. Chenlin Ming (6 papers)
  8. H. Vicky Zhao (22 papers)
  9. Conghui He (114 papers)
  10. Lijun Wu (113 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com