Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Qwen-Math-7B-CFT: Critique Fine-Tuning for Math LLM

Updated 30 June 2025

Qwen-Math-7B-CFT is a math-specialized large language model that uses single-seed critique fine-tuning to enhance reasoning capabilities.
It leverages diverse critiques of candidate solutions to improve error detection and explanation in complex mathematical tasks.
The model achieves superior accuracy and efficiency over SFT and RL methods while using minimal compute resources.

Qwen-Math-7B-CFT is a math-specialized LLM that applies Critique Fine-Tuning (CFT) to Qwen2.5-Math-7B, yielding significant improvements in mathematical and logical reasoning with exceptionally high compute efficiency. CFT enables the model to surpass or rival reinforcement learning-based methods at a fraction of the computational cost, using critique data constructed from just a single representative problem. This approach leverages the latent reasoning potential inherited during pre-training and is robust across benchmarks, random seeds, and diverse solution/error types.

1. Foundations: Critique Fine-Tuning (CFT) on Qwen2.5-Math-7B

Critique Fine-Tuning (CFT) is a post-training paradigm in which the model learns from diverse, LLM-authored critiques of candidate solutions generated for a single seed problem. The process involves:

Seed problem selection: A math problem of moderate to high difficulty is chosen from a reputable benchmark such as DeepScaleR.
Candidate solution generation: Multiple open-source models (including varieties of Qwen, MiMo, DeepSeek, Phi-4) independently generate solutions, resulting in a set of 100 diverse outputs per seed.
Critique collection: For each candidate solution, a panel of high-performing teacher LLMs (e.g., Claude-3, GPT-4.1, GPT-4o, O3-Mini) produces detailed critiques, analyzing correctness, spotting errors, and, if needed, providing the correct solution with explanation.
Data filtering and formatting: Inaccurate or inconsistent critiques are removed, and the remaining are uniformly formatted, yielding ~600 high-quality (problem, solution)→critique training examples per seed.
Instruction fine-tuning: The Qwen2.5-Math-7B model is fine-tuned on this dataset (typically 5 GPU hours), learning to generate critiques when presented with a problem and candidate solution.

The fine-tuned model, referred to as Qwen-Math-7B-CFT, demonstrates enhanced error detection, generalization, and explanatory capabilities across diverse reasoning tasks.

2. Performance Improvements on Mathematical and Logical Benchmarks

Qwen-Math-7B-CFT achieves substantial and broad improvements, as quantified on six mathematical and three logical reasoning benchmarks:

Method	Math-500	Minerva	Olympiad	AIME24	AIME25	AMC23	AVG
Base	58.6	17.3	17.5	16.7	10.8	43.1	27.3
SFT (1 ex)	53.8	14.3	18.2	12.1	6.7	32.5	22.9
SFT (full)	58.6	24.6	27.6	10.0	7.1	45.3	28.9
RL (1 ex)	79.2	27.9	39.1	23.8	10.8	60.3	40.2
CFT (1 ex)	76.4	40.4	39.3	18.8	14.6	63.4	42.2

Math-500: +14.9 percentage points gain over the base model.
Logical reasoning (BBEH): +16% average improvement, with robust accuracy gains on Causal Understanding, DisambiguationQA, and Time Arithmetic subtasks.
CFT on a single problem matches or surpasses RL-based methods with only 5 GPU hours, compared to 120+ hours for RL.

3. Robustness, Generality, and Ablation Insights

Experiments demonstrate that the impact of one-shot CFT is:

Robust to seed selection: Any moderately difficult math problem as seed yields strong generalization; moderately hard problems with a diverse mix of solutions/critiques yield optimal effects.
Enhanced by solution diversity: Critiques covering a wider variety of solution strategies and errors enable the model to learn a generalized error-analysis strategy.
Effective across domains: CFT consistently improves both mathematical and logical reasoning, even when the seed is from an orthogonal domain.
Scalable with size: Larger models benefit more, and CFT remains effective across scales (4B, 7B, 14B models tested).
Superior to SFT: Even when compared to SFT on the full dataset, CFT with a single example achieves higher accuracy.
Comparable to RL: Matches RL with verifiable reward (RLVR) on both math and logic, but is faster and more stable.

4. Critique Data Construction and Training Protocol

The critique data used for CFT is carefully constructed for maximal generalization:

Candidate solutions include typical and atypical error types (computational, methodological, conceptual).
Teacher critiques follow a structured contract: analyze step correctness, identify logical/semantic flaws, state correctness in a standardized format, and present the correct solution when errors are found.
Data is filtered to remove invalid or ambiguous critiques.
Input–output format: For each training item,

$(x, y) \rightarrow c$

where $x$ = problem, $y$ = candidate solution, $c$ = critique.

The model is trained with full-parameter instruction tuning, using a learning rate of $5\times10^{-6}$ , warmup ratio of 0.1, batch size 512, and a cosine schedule.

5. Comparison to Reinforcement and Supervised Fine-Tuning

Against SFT: One-shot CFT is consistently superior even to SFT on 40k+ scale datasets.
Against RL: CFT matches or outperforms RLVR, with vastly reduced computation, greater overall stability, and without the reward attribution instability or credit assignment issues common in RL.
Error generalization: Unlike SFT and RL, CFT-boosted models acquire reasoning strategies that transfer well between math and logic domains, and are effective at detecting and explaining variable error types.

6. Technical Impact and Broader Significance

Qwen-Math-7B-CFT demonstrates that:

One-shot critique fine-tuning is a highly practical, scalable mechanism for augmenting the reasoning skills of strong pre-trained math LLMs.
Compute and data efficiency are substantially increased: One seed problem, ~600 critiques, and 5 GPU hours deliver up to 15% accuracy gains across tasks.
The effectiveness is robust to model architecture, data domain, and the specific details of the seed problem or critique generators.
The methodology is extensible to other domains (logic, code), provided diverse candidate solutions and high-quality critiques are available.

7. Example: Impact on Error Detection and Solution Quality

After CFT, Qwen-Math-7B-CFT demonstrates new capabilities, such as effective application of algorithms (e.g., Extended Euclidean Algorithm for modular arithmetic), with step-by-step checks and error correction. For example, it solves equations like $14u \equiv 46 \pmod{100}$ with accurate modular inversion reasoning and explicit intermediate verification.

Candidate solutions after CFT include:

More explicit reasoning,
Clear recognition and explanation of errors,
Precise correction, and
Step delineation in line with mathematical best practices.

Summary Table: Critique Fine-Tuning vs. Baselines for Qwen2.5-Math-7B

Method	GPU Hours	Math Accuracy (%)	Logic Accuracy (%)	Generalization	Stability
SFT (1 ex)	5	22.9	11.4	Poor	Stable
SFT (full)	120+	28.9	15–20	Poor–Fair	Stable
RLVR (1 ex)	120+	40.2	23.9	Good	Unstable
CFT (1 ex)	5	42.2	24.7	Excellent	Stable

Qwen-Math-7B-CFT thus represents a new paradigm for efficiently unlocking and expanding the reasoning abilities of LLMs in mathematics, combining critique-based data, minimal compute resources, and generalized, transferable improvements across domains.

PDF Markdown Chat (Upgrade)