DeepCritic: Neural Critique & Calibration

Updated 17 June 2026

DeepCritic is a dual-domain framework for automated critique in language and vision, leveraging dedicated critic networks to evaluate and guide model outputs.
It employs a two-stage method in LLMs, combining supervised fine-tuning with reinforcement learning to generate precise, step-level critiques.
In image classification, DeepCritic formalizes training as a generator–critic game, boosting active learning and significantly reducing calibration errors.

DeepCritic refers to two lines of research advancing automated critique and supervision using neural networks operating in either language or vision domains. In the context of mathematical reasoning with LLMs, DeepCritic denotes a two-stage framework for step-level deliberate critique. In image classification, DeepCritic (also referenced as CrtCl, for “Critic Loss for Image Classification”) formalizes classifier training as a generator–critic game, enhancing model calibration and active learning. Both approaches leverage a dedicated critic network (or LLM subsystem) trained to make reliable step-level or sample-level judgments, and both report state-of-the-art improvements in their respective domains (Yang et al., 1 May 2025, Rappazzo et al., 2024).

1. Deliberate Critique in LLMs: System Architecture

DeepCritic for mathematical reasoning in LLMs is a two-stage pipeline designed to cultivate a critic model capable of deliberate, multi-perspective stepwise critique. The pipeline consists of:

Stage 1: Supervised Fine-Tuning (SFT). A base LLM is fine-tuned on a seed corpus of in-depth, step-by-step critiques, each containing restatements of reasoning goals, multi-perspective logical checks, meta-critiques, and explicit correctness judgments.
Stage 2: Reinforcement Learning (RL). The SFT model is optimized using reward signals from either human-labeled (PRM800K) or automatically annotated (Monte Carlo correctness estimation) datasets, via Group Relative Policy Optimization (GRPO).

At inference, DeepCritic evaluates each solution step in sequence, producing a reflective critique per step and stopping at the first incorrect segment (or reporting “–1” if all steps are deemed correct) (Yang et al., 1 May 2025).

2. Supervised Fine-Tuning and Critique Generation (LLM Domain)

Seed data for SFT is generated through a multi-prompting and selection process leveraging the Qwen2.5-72B-Instruct model:

Data Construction:
- Draw approximately 4,500 problem–solution pairs from PRM800K.
- For each step, generate an initial critique, followed by up to 16 in-depth critiques assessing either the step from an alternate perspective or the initial critique itself.
- Select candidate critiques whose correctness judgment aligns with the ground truth label.
- Merge critiques into a comprehensive, step-wise format.
Critique Structure:
- Each step critique consists of goal restatement, diverse logical verifications, critique of the critique (meta-level), and a final explicit judgment $\boxed{1}$ or $\boxed{-1}$ .
Label Conventions:
- PRM800K steps originally labeled ‘0’ (neutral) are mapped to correct.
- Solutions are truncated at the first detected wrong step.
SFT Objective:

$\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$

Seed-SFT demonstrably elevates baseline step-error F1 from 34.1% to 54.1% (Yang et al., 1 May 2025).

3. Reinforcement Learning and Data Sources (LLM Domain)

Subsequent RL sharpens the critic’s discrimination ability:

Data Sources:
- Approximately 40.7K supervised PRM800K examples.
- 14.2K “Numina” auto-annotated problems derived via Monte Carlo correctness estimation over GSM8K, MATH, and Olympiad datasets.
Labeling for Auto-Annotation:
- The first error step is set as the earliest index such that all later rollouts are wrong, and more than half of previous steps succeed.
Optimization:
- Start from $\theta_{\text{SFT}}$ , update via policy gradients to maximize match between predicted error index $a$ and true index $l$ :
$r(a; l)= \begin{cases} 1.0 & \text{if } a=l \ 0.0 & \text{otherwise} \end{cases}$

$R(\pi_\theta;P,S,l) = \sum_{c_{1:k},j_{1:k},a}\pi_\theta(c_{1:k},j_{1:k},a\mid P,S)\,r(a;l)$
Hyperparameters: LR $= 1$ e $^{-6}$ (RL), batch size $\boxed{-1}$ 0, epochs $\boxed{-1}$ 1 (RL), with Group Relative Policy Optimization.
Model Variants: SFT and RL models exist for both PRM800K and Numina data; all derive from Qwen2.5-7B-Instruct.

4. Critic Loss for Image Classification (“DeepCritic” / CrtCl)

In image classification, DeepCritic or CrtCl implements a generator–critic paradigm:

Base Classifier (Generator):
- Input: $\boxed{-1}$ 2 RGB image $\boxed{-1}$ 3
- Backbone: ResNet-18, producing both softmax class probabilities $\boxed{-1}$ 4 and last 4 block feature maps $\boxed{-1}$ 5.
Correctness Critic:
- Inputs: $\boxed{-1}$ 6, $\boxed{-1}$ 7, output $\boxed{-1}$ 8.
- Each input undergoes GAP, an FC+ReLU to $\boxed{-1}$ 9-dim embedding, then concatenation and a final linear + scalar output $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 0 (approximate probability of being correct).
Losses:
- Labeled: Standard cross-entropy $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 1.
- Critic loss (Wasserstein GAN style):
$\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 2 - Generator loss on unlabeled: $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 3 - Combined: $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 4
Optimization: The critic is trained adversarially (like WGAN), and generator updates combine both labeled and unlabeled loss signals (Rappazzo et al., 2024).

5. Empirical Performance and Benchmarks

LLM Critique Domain

Benchmarks: MR-GSM8K, PRM800K, GSM8K, MATH, OlympiadBench, Omni-Math. Primary Metric: $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 5 score on both correct and incorrect solution discrimination.

Model	MR-GSM8K	PRM800K	GSM8K	MATH	Olympiad	Omni-Math	Avg
Qwen2.5-7B-Instruct	48.1	25.6	42.9	36.6	25.5	25.9	34.1
DeepSeek-R1-Distill-Qwen-7B	77.9	57.4	71.9	69.9	56.4	46.8	63.4
GPT-4o	69.7	45.9	72.1	57.3	50.5	53.4	58.2
DeepCritic-7B-SFT	67.1	48.0	59.2	61.2	46.0	43.0	54.1
DeepCritic-7B-RL-Numina	77.2	55.9	70.7	65.9	57.6	53.5	63.5
DeepCritic-7B-RL-PRM800K	77.3	60.1	74.0	72.9	60.9	57.2	67.1

Key ablation: Multi-perspective in-depth critiques correct both initial misses and false positives, and RL with human-labeled data outperforms auto-annotated alternatives.

Image Classification Domain

Datasets: CIFAR-10, CIFAR-100, SVHN
Findings:
- 2–5% improvement in test accuracy over best prior learned-loss (e.g., Learning Loss, TOD) and uncertainty-sampling baselines, particularly in low-label regimes.
- Expected Calibration Error halved (e.g., from $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 6 on CIFAR-10).
- In semi-supervised ablations, accuracy and calibration improve $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 7 over cross-entropy alone.
- Active learning cycles select datapoints with lowest predicted $\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 8 for labeling, confirming CrtCl's effectiveness in uncertainty estimation (Rappazzo et al., 2024).

6. Algorithmic Workflows

LLM DeepCritic Training (Two-Stage)

$\theta_{\text{SFT}} = \arg\min_{\theta}\; \mathbb{E}_{(P,S,C)\sim\mathcal{D}_{\text{SFT}}}\bigl[-\log P_{\theta}(C\mid P,S)\bigr]$ 9

Image Domain: Generator–Critic Training Loop

$\theta_{\text{SFT}}$ 0

7. Significance and Implications

DeepCritic’s two-stage, reflective critique framework in LLMs establishes new state-of-the-art for automated error localization and actionable feedback in mathematical reasoning, outperforming comparable size models and even proprietary systems such as GPT-4o on several benchmarks. Multi-perspective, meta-level critique is critical for refining solution steps and providing sufficient guidance for downstream correction (Yang et al., 1 May 2025).

In image classification, DeepCritic/CrtCl delivers improved generalization and, crucially, dramatically better calibration in both purely supervised and semi-supervised/active learning settings. Critic-driven label selection in active learning is empirically superior to competing uncertainty and learned-loss approaches, especially under severe label budget constraints. A plausible implication is broader applicability to semi-supervised and reliability-critical vision tasks (Rappazzo et al., 2024).

The DeepCritic paradigm, whether in language or vision, demonstrates the advantages of explicit, trainable supervision signals provided by critic networks, especially when these are trained to reflect on model outputs via multi-perspective, adversarial, or uncertainty-driven mechanisms.

Markdown Report Issue Upgrade to Chat

References (2)

DeepCritic: Deliberate Critique with Large Language Models (2025)

Critic Loss for Image Classification (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepCritic.

DeepCritic: Neural Critique & Calibration

1. Deliberate Critique in LLMs: System Architecture

2. Supervised Fine-Tuning and Critique Generation (LLM Domain)

3. Reinforcement Learning and Data Sources (LLM Domain)

4. Critic Loss for Image Classification (“DeepCritic” / CrtCl)

5. Empirical Performance and Benchmarks

LLM Critique Domain

Image Classification Domain

6. Algorithmic Workflows

LLM DeepCritic Training (Two-Stage)

Image Domain: Generator–Critic Training Loop

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeepCritic: Neural Critique & Calibration

1. Deliberate Critique in LLMs: System Architecture

2. Supervised Fine-Tuning and Critique Generation (LLM Domain)

3. Reinforcement Learning and Data Sources (LLM Domain)

4. Critic Loss for Image Classification (“DeepCritic” / CrtCl)

5. Empirical Performance and Benchmarks

LLM Critique Domain

Image Classification Domain

6. Algorithmic Workflows

LLM DeepCritic Training (Two-Stage)

Image Domain: Generator–Critic Training Loop

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research