- The paper’s main contribution is introducing Critique Fine-Tuning, which trains models to critique noisy responses for improved reasoning capabilities.
- Experiments on math benchmarks show that models enhanced with CFT achieve 4–10% better accuracy using only 50K training samples compared to standard SFT.
- Ablation studies confirm CFT’s robustness across different datasets and critique sources, highlighting its practical value in STEM domains.
The paper "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" introduces a novel approach to training LLMs, named Critique Fine-Tuning (CFT), which focuses on teaching models to critique noisy responses rather than simply imitating correct ones. The authors posit that this method fosters deeper analysis and nuanced understanding, leading to improved reasoning capabilities.
To validate CFT, the authors created a 50K-sample dataset from WebInstruct, using GPT-4o to generate critiques for query-response pairs. The format of each sample in the dataset is (input=[query; noisy response], output=critique). Experiments on six math benchmarks (including MATH and AIME24) using Qwen2.5, Qwen2.5-Math, and DeepSeek-Math as base models, showed that CFT yields a consistent 4-10% improvement over SFT. Further experiments on MetaMath and NuminaMath datasets yielded similar gains over SFT. The Qwen2.5-Math-CFT model, trained on 50K samples, matches or outperforms AceMath and Qwen2.5-Math-Instruct on most benchmarks, despite the latter two being trained on over 2M samples. Ablation studies indicate that CFT is robust to the source of the noisy response and the teacher critique model.
The WebInstruct dataset, which was used to validate the effectiveness of CFT, is an instruction dataset sourced from the Internet. It includes topics such as Mathematics (65%), Physics (8%), Chemistry (4%), Business (10%), and Humanities (4%). The authors curated multiple subsets from WebInstruct. These subsets are:
- WebInstruct-SFT: A 50K subset directly sampled from the original WebInstruct dataset with an error ratio exceeding 50%.
- WebInstruct-verified: A 50K subset of WebInstruct samples where GPT-4o-1120 verified the correctness of the original answers.
- WebInstruct-GPT-4o: A 50K subset reusing questions from WebInstruct-SFT but replacing the original answers with those generated by GPT-4o-1120.
- WebInstruct-CFT: A 50K subset derived from WebInstruct-SFT, where GPT-4o-1120 provides critiques of the original responses, with approximately 56% of the responses deemed "correct."
- WebInstruct-CFT-Tiny: A 4K subset of WebInstruct-CFT designed for training a 32B model.
In addition to WebInstruct, the authors synthesized critiques for MetaMathQA and NuminaMath by randomly sampling 50K examples from each dataset and using GPT-4o to critique the original responses. The training objective is to maximize the likelihood P(c∣[x;y];θ), where c represents the critique, x the question, y the noisy response, and θ the parameters of the LLM.
argmaxθlogP(c∣[x;y];θ)
The authors evaluated their method on mathematical reasoning benchmarks, including MATH, Minerva-Math, and GSM8K. They also assessed performance on competition-level mathematics using AIME 2024, AMC 2023, and OlympiadBench. Broader STEM reasoning capabilities were evaluated using TheoremQA, MMLU-Pro, and GPQA. For training, the authors evaluated three SFT settings: training directly on original noisy responses, training on responses verified by GPT-4o, and training on responses generated by GPT-4o. The CFT model was trained using the curated CFT datasets. MATH-500 was used as the validation set, and the best-performing checkpoint was selected after training on the entire dataset for one epoch. Consistent hyperparameters were maintained across all experiments, including a learning rate of 5e-6, a cosine decay learning schedule with a warm-up ratio of 0.1, and a global batch size of 512.
The results show that Qwen2.5-Math-7B is a strong base model, achieving 37.8% average accuracy across benchmarks. Enhanced with CFT, it achieves a best performance of 56.0% average accuracy. CFT consistently outperformed all SFT baselines across different models. On DeepSeek-Math-7B, CFT achieves a 3.5% absolute improvement over SFT-GPT4o. On Qwen2.5-7B, CFT demonstrates a 10.4% improvement over SFT-verified. On Qwen2.5-Math-7B, CFT surpasses the GPT-4o SFT baseline by 5.7%. CFT also achieves better performance with less training data and improves performance on benchmarks like MATH (79.4%) and OlympiadBench (41.6%).
In comparison to other models, the Qwen2.5-Math-7B-CFT achieves the highest average performance (48.0%) among 7B-scale models, using only 50K training samples. The authors compared the Qwen2.5-32B-Instruct-CFT to Sky-T1-32B-Preview and show that Qwen2.5-32B-Instruct-CFT achieves performance with only 4K training samples, compared to Sky-T1-32B-Preview's 17K samples.
The authors ablated the impact of different training datasets on model performance. When trained with SFT, both MetaMathQA and NuminaMath achieve better performance than WebInstruct (47.3% and 37.5% vs. 35.1% on average). However, when trained with CFT, WebInstruct achieves the best performance (56.0%), outperforming MetaMathQA and NuminaMath.
In an ablation of the solution source, the authors compared solutions generated by Qwen2.5-Math-7B itself versus reference solutions from the WebInstruct dataset. The results show that using reference solutions achieves slightly better performance (56.0% vs. 54.5% on average). In an ablation studying the teacher critique model, the authors compared the performance when using GPT-4o-mini and GPT-4o-1120 as critique models. Even with GPT-4o-mini, CFT significantly outperforms the SFT-verified baseline (51.5% vs. 40.4% on average).
The authors identified that approximately 20% of the critiques generated by GPT-4o-1120 on WebInstruct contained errors or inaccurate feedback. The authors also investigated incorporating self-critique mechanisms, but found that these mechanisms consistently underperform compared to direct inference. In single-pass self-critique, the model solves the problem and critiques its solution in one pass, and if errors are detected, it generates a new solution. In two-stage self-critique, the model first generates a solution, then separately evaluates it, and if issues are found, the model iterates this process (up to 8 attempts).