Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (2501.17703v4)

Published 29 Jan 2025 in cs.CL

Abstract: Supervised Fine-Tuning (SFT) is commonly used to train LLMs to imitate annotated responses for given instructions. In this paper, we propose Critique Fine-Tuning (CFT), a method more effective than SFT for reasoning tasks. Instead of simply imitating correct responses, CFT trains models to critique noisy responses, inspired by human learning processes that emphasize critical thinking, deeper analysis, and nuanced understanding - traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct multiple critique datasets (e.g., WebInstruct, MetaMath, NuminaMath), where GPT-4o serves as the teacher to generate critiques in the form of ([query; noisy response], critique). Experiments on these datasets demonstrate that CFT consistently outperforms SFT by 4-10% across six mathematical reasoning benchmarks, and is effective across different base models including Qwen2.5, Qwen2.5-Math, and DeepSeek-Math. Notably, our model Qwen2.5-Math-CFT only requires 1 hour of training on 8 x H100 over the 50K examples, yet matches or outperforms strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it matches the performance of SimpleRL, which is a DeepSeek-r1 replication trained with 140 x more compute. Experiments on IF_Eval and MT-Bench further demonstrate that CFT can significantly enhance the model's general generation and instruction-following capabilities, outperforming the Qwen2.5-Math-Instruct by a large margin. Ablation studies show that CFT is robust to noisy response sources and teacher critique models. These findings highlight that CFT offers a more effective alternative to advance the reasoning of LLMs.

Summary

The paper’s main contribution is introducing Critique Fine-Tuning, which trains models to critique noisy responses for improved reasoning capabilities.
Experiments on math benchmarks show that models enhanced with CFT achieve 4–10% better accuracy using only 50K training samples compared to standard SFT.
Ablation studies confirm CFT’s robustness across different datasets and critique sources, highlighting its practical value in STEM domains.

The paper "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" introduces a novel approach to training LLMs, named Critique Fine-Tuning (CFT), which focuses on teaching models to critique noisy responses rather than simply imitating correct ones. The authors posit that this method fosters deeper analysis and nuanced understanding, leading to improved reasoning capabilities.

To validate CFT, the authors created a 50K-sample dataset from WebInstruct, using GPT-4o to generate critiques for query-response pairs. The format of each sample in the dataset is (input=[query; noisy response], output=critique). Experiments on six math benchmarks (including MATH and AIME24) using Qwen2.5, Qwen2.5-Math, and DeepSeek-Math as base models, showed that CFT yields a consistent 4-10% improvement over SFT. Further experiments on MetaMath and NuminaMath datasets yielded similar gains over SFT. The Qwen2.5-Math-CFT model, trained on 50K samples, matches or outperforms AceMath and Qwen2.5-Math-Instruct on most benchmarks, despite the latter two being trained on over 2M samples. Ablation studies indicate that CFT is robust to the source of the noisy response and the teacher critique model.

The WebInstruct dataset, which was used to validate the effectiveness of CFT, is an instruction dataset sourced from the Internet. It includes topics such as Mathematics (65%), Physics (8%), Chemistry (4%), Business (10%), and Humanities (4%). The authors curated multiple subsets from WebInstruct. These subsets are:

WebInstruct-SFT: A 50K subset directly sampled from the original WebInstruct dataset with an error ratio exceeding 50%.
WebInstruct-verified: A 50K subset of WebInstruct samples where GPT-4o-1120 verified the correctness of the original answers.
WebInstruct-GPT-4o: A 50K subset reusing questions from WebInstruct-SFT but replacing the original answers with those generated by GPT-4o-1120.
WebInstruct-CFT: A 50K subset derived from WebInstruct-SFT, where GPT-4o-1120 provides critiques of the original responses, with approximately 56% of the responses deemed "correct."
WebInstruct-CFT-Tiny: A 4K subset of WebInstruct-CFT designed for training a 32B model.

In addition to WebInstruct, the authors synthesized critiques for MetaMathQA and NuminaMath by randomly sampling 50K examples from each dataset and using GPT-4o to critique the original responses. The training objective is to maximize the likelihood $P(c|[x;y];\theta)$ , where $c$ represents the critique, $x$ the question, $y$ the noisy response, and $\theta$ the parameters of the LLM.

$argmax_{\theta} \log P(c|[x;y];\theta)$

The authors evaluated their method on mathematical reasoning benchmarks, including MATH, Minerva-Math, and GSM8K. They also assessed performance on competition-level mathematics using AIME 2024, AMC 2023, and OlympiadBench. Broader STEM reasoning capabilities were evaluated using TheoremQA, MMLU-Pro, and GPQA. For training, the authors evaluated three SFT settings: training directly on original noisy responses, training on responses verified by GPT-4o, and training on responses generated by GPT-4o. The CFT model was trained using the curated CFT datasets. MATH-500 was used as the validation set, and the best-performing checkpoint was selected after training on the entire dataset for one epoch. Consistent hyperparameters were maintained across all experiments, including a learning rate of 5e-6, a cosine decay learning schedule with a warm-up ratio of 0.1, and a global batch size of 512.

The results show that Qwen2.5-Math-7B is a strong base model, achieving 37.8% average accuracy across benchmarks. Enhanced with CFT, it achieves a best performance of 56.0% average accuracy. CFT consistently outperformed all SFT baselines across different models. On DeepSeek-Math-7B, CFT achieves a 3.5% absolute improvement over SFT-GPT4o. On Qwen2.5-7B, CFT demonstrates a 10.4% improvement over SFT-verified. On Qwen2.5-Math-7B, CFT surpasses the GPT-4o SFT baseline by 5.7%. CFT also achieves better performance with less training data and improves performance on benchmarks like MATH (79.4%) and OlympiadBench (41.6%).

In comparison to other models, the Qwen2.5-Math-7B-CFT achieves the highest average performance (48.0%) among 7B-scale models, using only 50K training samples. The authors compared the Qwen2.5-32B-Instruct-CFT to Sky-T1-32B-Preview and show that Qwen2.5-32B-Instruct-CFT achieves performance with only 4K training samples, compared to Sky-T1-32B-Preview's 17K samples.

The authors ablated the impact of different training datasets on model performance. When trained with SFT, both MetaMathQA and NuminaMath achieve better performance than WebInstruct (47.3% and 37.5% vs. 35.1% on average). However, when trained with CFT, WebInstruct achieves the best performance (56.0%), outperforming MetaMathQA and NuminaMath.

In an ablation of the solution source, the authors compared solutions generated by Qwen2.5-Math-7B itself versus reference solutions from the WebInstruct dataset. The results show that using reference solutions achieves slightly better performance (56.0% vs. 54.5% on average). In an ablation studying the teacher critique model, the authors compared the performance when using GPT-4o-mini and GPT-4o-1120 as critique models. Even with GPT-4o-mini, CFT significantly outperforms the SFT-verified baseline (51.5% vs. 40.4% on average).

The authors identified that approximately 20% of the critiques generated by GPT-4o-1120 on WebInstruct contained errors or inaccurate feedback. The authors also investigated incorporating self-critique mechanisms, but found that these mechanisms consistently underperform compared to direct inference. In single-pass self-critique, the model solves the problem and critiques its solution in one pass, and if errors are detected, it generates a new solution. In two-stage self-critique, the model first generates a solution, then separately evaluates it, and if issues are found, the model iterates this process (up to 8 attempts).

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/xiangyue96/status/1885355880188617038

https://twitter.com/WenhuChen/status/1923454468655399420

https://twitter.com/theomitsa/status/1885627480922292364

https://twitter.com/rarply/status/1886628182159241544

https://twitter.com/arxivsanitybot/status/1885681832194359479

https://twitter.com/arXivGPT/status/1885388596825239756