Enhancing Mathematical Abilities of LLMs Through a Self-Critique Pipeline
Introduction to the Self-Critique Pipeline
Recent advancements in LLMs have ignited significant interest in their application beyond standard linguistic tasks, extending into domains necessitating complex reasoning and problem-solving skills such as mathematics. Despite their proficiency in understanding and generating human language, LLMs face challenges in applications requiring accurate mathematical reasoning and problem-solving. This paper introduces a novel approach, named the Self-Critique pipeline, tailored to amend this gap by enhancing both language understanding and mathematical problem-solving capabilities of LLMs without favoring one at the expense of the other. The backbone of this approach is a Math-Critique model, derived from the LLM itself, to evaluate its mathematical outputs critically. This self-assessment mechanism then informs the fine-tuning of the LLM through Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO), sequentially improving the model's accuracy and proficiency in mathematical reasoning.
The Self-Critique Pipeline in Detail
The Self-Critique pipeline consists of two cornerstone stages, Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO), both leveraging the Math-Critique model's assessments:
- Rejective Fine-tuning (RFT) focuses on refining the LLM's responses based on the Math-Critique evaluations. The model undergoes iterations of response generation, with subpar responses as judged by the Math-Critique being discarded. This approach ensures the diversity and quality of responses for further fine-tuning.
- Direct Preference Optimization (DPO) builds on the foundation set by RFT, directly learning from pairs of correct and incorrect responses as identified by Math-Critique. This step concentrates on enhancing the model's capability in handling the most challenging questions that were not adequately addressed in the RFT stage.
Evaluation and Results
To validate the effectiveness of the Self-Critique pipeline, the paper introduces the MathUserEval benchmark. This benchmark is designed to mirror real-world mathematical problem-solving requirements more closely than traditional academic datasets. Experimental results showcase significant improvement, with the enhanced LLM, based on ChatGLM3-32B, outperforming its counterparts, including models double its size, across both mathematical reasoning and language understanding tasks.
The MathUserEval Benchmark
The MathUserEval benchmark aims to assess LLMs' performance on complex, real-world applicable mathematical queries. Unlike conventional datasets focused purely on academic mathematics, MathUserEval incorporates practical application scenarios. The benchmark, evaluated using both GPT-4-Turbo and the Math-Critique model, establishes a more comprehensive test for practical mathematical reasoning capabilities.
Conclusion and Future Directions
The Self-Critique pipeline constitutes a significant step forward in enabling LLMs to improve both their mathematical and linguistic capabilities autonomously. The incorporation of the Math-Critique model for internal evaluations allows for a more nuanced and effective fine-tuning approach. Evaluated using the robust MathUserEval benchmark, the enhanced LLM demonstrates notable advancement in tackling complex mathematical problems, indicating the potential of this approach in real-world applications. Future research will likely explore the extension of the Self-Critique pipeline to other domains requiring specialized reasoning capabilities, further broadening the utility and efficacy of LLMs.