ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline (2404.02893v1)

Published 3 Apr 2024 in cs.CL

Abstract: LLMs have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems.In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger. Related techniques have been deployed to ChatGLM\footnote{\url{https://chatglm.cn}}, an online serving LLM. Related evaluation dataset and scripts are released at \url{https://github.com/THUDM/ChatGLM-Math}.

PDF HTML Abstract

Enhancing Mathematical Abilities of LLMs Through a Self-Critique Pipeline

Introduction to the Self-Critique Pipeline

Recent advancements in LLMs have ignited significant interest in their application beyond standard linguistic tasks, extending into domains necessitating complex reasoning and problem-solving skills such as mathematics. Despite their proficiency in understanding and generating human language, LLMs face challenges in applications requiring accurate mathematical reasoning and problem-solving. This paper introduces a novel approach, named the Self-Critique pipeline, tailored to amend this gap by enhancing both language understanding and mathematical problem-solving capabilities of LLMs without favoring one at the expense of the other. The backbone of this approach is a Math-Critique model, derived from the LLM itself, to evaluate its mathematical outputs critically. This self-assessment mechanism then informs the fine-tuning of the LLM through Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO), sequentially improving the model's accuracy and proficiency in mathematical reasoning.

The Self-Critique Pipeline in Detail

The Self-Critique pipeline consists of two cornerstone stages, Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO), both leveraging the Math-Critique model's assessments:

Rejective Fine-tuning (RFT) focuses on refining the LLM's responses based on the Math-Critique evaluations. The model undergoes iterations of response generation, with subpar responses as judged by the Math-Critique being discarded. This approach ensures the diversity and quality of responses for further fine-tuning.
Direct Preference Optimization (DPO) builds on the foundation set by RFT, directly learning from pairs of correct and incorrect responses as identified by Math-Critique. This step concentrates on enhancing the model's capability in handling the most challenging questions that were not adequately addressed in the RFT stage.

Evaluation and Results

To validate the effectiveness of the Self-Critique pipeline, the paper introduces the MathUserEval benchmark. This benchmark is designed to mirror real-world mathematical problem-solving requirements more closely than traditional academic datasets. Experimental results showcase significant improvement, with the enhanced LLM, based on ChatGLM3-32B, outperforming its counterparts, including models double its size, across both mathematical reasoning and language understanding tasks.

The MathUserEval Benchmark

The MathUserEval benchmark aims to assess LLMs' performance on complex, real-world applicable mathematical queries. Unlike conventional datasets focused purely on academic mathematics, MathUserEval incorporates practical application scenarios. The benchmark, evaluated using both GPT-4-Turbo and the Math-Critique model, establishes a more comprehensive test for practical mathematical reasoning capabilities.

Conclusion and Future Directions

The Self-Critique pipeline constitutes a significant step forward in enabling LLMs to improve both their mathematical and linguistic capabilities autonomously. The incorporation of the Math-Critique model for internal evaluations allows for a more nuanced and effective fine-tuning approach. Evaluated using the robust MathUserEval benchmark, the enhanced LLM demonstrates notable advancement in tackling complex mathematical problems, indicating the potential of this approach in real-world applications. Future research will likely explore the extension of the Self-Critique pipeline to other domains requiring specialized reasoning capabilities, further broadening the utility and efficacy of LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yifan Xu (92 papers)
Xiao Liu (402 papers)
Xinghan Liu (10 papers)
Zhenyu Hou (20 papers)
Yueyan Li (5 papers)
Xiaohan Zhang (78 papers)
Zihan Wang (181 papers)
Aohan Zeng (19 papers)
Zhengxiao Du (22 papers)
Wenyi Zhao (10 papers)
Jie Tang (302 papers)
Yuxiao Dong (119 papers)

Citations (26)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1775704617260576789

https://twitter.com/_akhaliq/status/1775724474106581316

https://twitter.com/burny_tech/status/1777772193209786604

https://twitter.com/gm8xx8/status/1775697536818335939

https://twitter.com/arxivsanitybot/status/1776239912141852902

https://twitter.com/knishimae0531/status/1775732111112347869