DeepCritic Framework

Updated 19 December 2025

DeepCritic is a multi-stage framework that improves LLM math reasoning by enforcing detailed, stepwise critiques and multi-perspective verification.
It uses a two-stage pipeline combining supervised fine-tuning and reinforcement learning to transition from shallow to deliberate critique.
Empirical benchmarks show significant gains in step-index accuracy and oversight compared to traditional LLM verification methods.

The DeepCritic framework is a deliberate, multi-stage approach for enhancing the ability of LLMs to provide detailed, accurate, and actionable critiques of mathematical reasoning. DeepCritic is designed to overcome limitations of "shallow" critic models by enforcing stepwise deliberation, multi-perspective verification, and meta-reflection. Through a combination of supervised fine-tuning and reinforcement learning, DeepCritic achieves high judgment accuracy and enables scalable oversight in math reasoning tasks, establishing benchmarks for critique-driven LLM supervision (Yang et al., 1 May 2025).

1. Architecture and Core Principles

DeepCritic utilizes a two-stage pipeline atop an instruction-tuned LLM (Qwen2.5-Instruct). Its central objective is to transition from superficial math verification to deliberate, in-depth critique. At inference, DeepCritic processes a mathematical problem $P$ and a candidate solution $S = (s_1, s_2, ..., s_n)$ , generating for each reasoning step $s_i$ :

A chain-of-thought (CoT) critique $c_i$
A binary judgment $j_i \in \{1, -1\}$ (correct/incorrect)
The index $a$ of the first incorrect step (or $-1$ if all correct)

This transition enforces:

Deliberation: Stepwise, detailed CoT critiques precede any judgment.
Multi-perspective verification: Each step is validated via distinct reasoning paradigms.
Meta-critiquing: The framework produces critiques of its own initial critiques.
Scalability via LLM generation: LLMs are leveraged for critique data and supervision.

2. Supervised Fine-tuning of Critique Ability

In Stage 1, DeepCritic constructs a deliberate critique protocol using ~4.5K "seed" solution-level critiques. The protocol involves:

Seed Generation: Qwen2.5-72B-Instruct ( $\theta^*$ ) is employed to produce initial stepwise critiques, followed by in-depth meta-critiques for each solution, aligning labels with the human-labeled PRM800K dataset ( $\ell_i \in \{1, -1\}$ ).

For each step $i$ in solution $S$ : - Initial critique:

$(c_i^{\rm init},\,j_i^{\rm init}) = \pi_{\theta^*}\left(\cdot\,\mid\, P,\,s_1,\dots,s_n,\,s_{\rm target}=s_i \right)$

Deep critique:

$(c_i^{\rm deep},\,j_i^{\rm deep}) = \pi_{\theta^*}\left(\cdot\,\mid\, P,\,s_1,\dots,s_n,\, c_i^{\rm init},\,j_i^{\rm init},\,s_{\rm target}=s_i \right)$
Final synthesis:

$(c_i^{\rm final},\,j_i^{\rm final}) = \pi_{\theta^*}\left(\cdot\,\mid\, c_i^{\rm init},\,j_i^{\rm init},\,c_i^{\rm deep},\,j_i^{\rm deep},\,\left\{\mathrm{ex}_l\right\}\right)$

Supervised Objective: The critic model $\theta_{\rm SFT}$ (Qwen2.5-7B-Instruct) is trained to maximize likelihood over the seed critiques:

$\theta_{\rm SFT} = \arg\min_{\theta}\, \mathbb{E}_{(P,S,C)\sim \mathcal D_{\rm SFT}}\left[-\log\,P_\theta(C|P,S)\right]$

This process promotes deliberate, multi-perspective critique capacity as an explicit protocol rather than shallow error flagging.

3. Reinforcement Learning for Judgment Accuracy

Stage 2 applies reinforcement learning to amplify the critic’s judgment precision. The reward focuses on identifying the first erroneous step with maximal accuracy.

Data Sources: Training utilizes either 40.7K instances from human-labeled PRM800K or 14.2K automatically labeled via Monte Carlo error detection, following Math-Shepherd pipeline:
- Monte Carlo estimation is executed by repeatedly rolling out solution completions from step truncations and computing statistics:
$\hat p_{<i} = \frac{1}{N}\sum_{k=1}^N \mathbf{1}[\delta_{j,k}=1\;\forall j<i],\;\; \hat p_{\ge i} = \frac{1}{N}\sum_{k=1}^N \mathbf{1}[\delta_{j,k}=1\;\forall j\ge i]$

The first erroneous step $i^*$ is determined by:

$i^* = \min\left\{i: \hat p_{\ge i}=0 \land \hat p_{<i}>0.5\right\}$
Reward Assignment:

$R(P, S, C) = \begin{cases} 1.0 & a = a^* \ 0.0 & \text{otherwise} \end{cases}$

Optimization is performed with Group Relative Policy Optimization (GRPO), omitting explicit KL regularization.

4. Implementation Configurations

Models: The base critic is Qwen2.5-7B-Instruct; SFT uses 4.5K seeds; RL uses 40.7K PRM800K or 14.2K Numina-auto data.
Hyperparameters:
- SFT: LR = $1 \times 10^{-5}$ , batch = 64, epochs = 3, sequence length ≤16,384, warmup ratio = 0.1.
- RL: batch = 128, rollouts = 8, prompt len ≤2,048, response len ≤8,192, temperature = 1.0, top_p = 0.9, LR = $1 \times 10^{-6}$ , epochs = 2.
Compute: Runs are feasible on standard multi-GPU clusters with response lengths truncated for throughput.

5. Empirical Benchmarking and Ablations

Performance is evaluated on MR-GSM8K, PRM800K Phase-2, and ProcessBench variants (GSM8K, MATH, OlympiadBench, Omni-Math), using F1 of step-index accuracy for incorrect solutions and "all correct" detection for correct solutions.

Model	F1 (avg across 6 sets)
Qwen2.5-7B-Instruct	34.1
GPT-4o	58.2
DeepSeek-R1-Distill-Qwen-7B	63.4
DeepCritic-7B-SFT	54.1
DeepCritic-7B-RL-Numina	63.5
DeepCritic-7B-RL-PRM800K	67.1

Ablation studies indicate:

Supervised fine-tuning alone (+20 F1 over base).
RL with PRM800K adds +3.6 F1 over auto-annotation.
Majority-vote (8 samples) yields +3–6 F1 gain.
In refinement, external generators filtered/guided by DeepCritic show improved problem-solving.

6. Comparative Context and Limitations

Deliberate multi-stage stepwise critique and meta-critique—central to DeepCritic—surpass the capabilities of shallow LLM verifiers and establish new approaches for automated reasoning supervision. Comparison with Critic-CoT and CRITIC highlights that automatic distant supervision and external feedback mechanisms complement DeepCritic's deliberate protocol (Zheng et al., 29 Aug 2024, Gou et al., 2023). A distinctive insight is that critique-skill and task-solving ability are mutually reinforcing, consistent with monotonic positive correlation observed in related frameworks (Zheng et al., 29 Aug 2024).

Limitations include:

Small seed dataset and computational cost for SFT generation.
RL response truncation and step constraints due to computability.
Noisier Monte Carlo auto-annotations compared to expert human-labeled data.

Future directions proposed:

Scaling seed-critique generation on more diverse problems.
Incorporating fine-grained reward signals (e.g., partial step credit).
Extension to multi-step critique loops and broader domains (code, commonsense).
Exploring advanced RL methods with KL regularization for stability.

A plausible implication is that deliberate critique frameworks like DeepCritic will generalize beyond math reasoning, enabling scalable, high-fidelity oversight across various reasoning and generative domains in LLMs.