DeepCritic: Neural Critique & Calibration
- DeepCritic is a dual-domain framework for automated critique in language and vision, leveraging dedicated critic networks to evaluate and guide model outputs.
- It employs a two-stage method in LLMs, combining supervised fine-tuning with reinforcement learning to generate precise, step-level critiques.
- In image classification, DeepCritic formalizes training as a generator–critic game, boosting active learning and significantly reducing calibration errors.
DeepCritic refers to two lines of research advancing automated critique and supervision using neural networks operating in either language or vision domains. In the context of mathematical reasoning with LLMs, DeepCritic denotes a two-stage framework for step-level deliberate critique. In image classification, DeepCritic (also referenced as CrtCl, for “Critic Loss for Image Classification”) formalizes classifier training as a generator–critic game, enhancing model calibration and active learning. Both approaches leverage a dedicated critic network (or LLM subsystem) trained to make reliable step-level or sample-level judgments, and both report state-of-the-art improvements in their respective domains (Yang et al., 1 May 2025, Rappazzo et al., 2024).
1. Deliberate Critique in LLMs: System Architecture
DeepCritic for mathematical reasoning in LLMs is a two-stage pipeline designed to cultivate a critic model capable of deliberate, multi-perspective stepwise critique. The pipeline consists of:
- Stage 1: Supervised Fine-Tuning (SFT). A base LLM is fine-tuned on a seed corpus of in-depth, step-by-step critiques, each containing restatements of reasoning goals, multi-perspective logical checks, meta-critiques, and explicit correctness judgments.
- Stage 2: Reinforcement Learning (RL). The SFT model is optimized using reward signals from either human-labeled (PRM800K) or automatically annotated (Monte Carlo correctness estimation) datasets, via Group Relative Policy Optimization (GRPO).
At inference, DeepCritic evaluates each solution step in sequence, producing a reflective critique per step and stopping at the first incorrect segment (or reporting “–1” if all steps are deemed correct) (Yang et al., 1 May 2025).
2. Supervised Fine-Tuning and Critique Generation (LLM Domain)
Seed data for SFT is generated through a multi-prompting and selection process leveraging the Qwen2.5-72B-Instruct model:
- Data Construction:
- Draw approximately 4,500 problem–solution pairs from PRM800K.
- For each step, generate an initial critique, followed by up to 16 in-depth critiques assessing either the step from an alternate perspective or the initial critique itself.
- Select candidate critiques whose correctness judgment aligns with the ground truth label.
- Merge critiques into a comprehensive, step-wise format.
- Critique Structure:
- Each step critique consists of goal restatement, diverse logical verifications, critique of the critique (meta-level), and a final explicit judgment or .
- Label Conventions:
- PRM800K steps originally labeled ‘0’ (neutral) are mapped to correct.
- Solutions are truncated at the first detected wrong step.
- SFT Objective:
Seed-SFT demonstrably elevates baseline step-error F1 from 34.1% to 54.1% (Yang et al., 1 May 2025).
3. Reinforcement Learning and Data Sources (LLM Domain)
Subsequent RL sharpens the critic’s discrimination ability:
- Data Sources:
- Approximately 40.7K supervised PRM800K examples.
- 14.2K “Numina” auto-annotated problems derived via Monte Carlo correctness estimation over GSM8K, MATH, and Olympiad datasets.
- Labeling for Auto-Annotation:
- The first error step is set as the earliest index such that all later rollouts are wrong, and more than half of previous steps succeed.
- Optimization:
- Start from , update via policy gradients to maximize match between predicted error index and true index :
Hyperparameters: LR e (RL), batch size 0, epochs 1 (RL), with Group Relative Policy Optimization.
Model Variants: SFT and RL models exist for both PRM800K and Numina data; all derive from Qwen2.5-7B-Instruct.
4. Critic Loss for Image Classification (“DeepCritic” / CrtCl)
In image classification, DeepCritic or CrtCl implements a generator–critic paradigm:
Base Classifier (Generator):
- Input: 2 RGB image 3
- Backbone: ResNet-18, producing both softmax class probabilities 4 and last 4 block feature maps 5.
- Correctness Critic:
- Inputs: 6, 7, output 8.
- Each input undergoes GAP, an FC+ReLU to 9-dim embedding, then concatenation and a final linear + scalar output 0 (approximate probability of being correct).
- Losses:
- Labeled: Standard cross-entropy 1.
- Critic loss (Wasserstein GAN style):
2 - Generator loss on unlabeled: 3 - Combined: 4
Optimization: The critic is trained adversarially (like WGAN), and generator updates combine both labeled and unlabeled loss signals (Rappazzo et al., 2024).
5. Empirical Performance and Benchmarks
LLM Critique Domain
Benchmarks: MR-GSM8K, PRM800K, GSM8K, MATH, OlympiadBench, Omni-Math. Primary Metric: 5 score on both correct and incorrect solution discrimination.
| Model | MR-GSM8K | PRM800K | GSM8K | MATH | Olympiad | Omni-Math | Avg |
|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 48.1 | 25.6 | 42.9 | 36.6 | 25.5 | 25.9 | 34.1 |
| DeepSeek-R1-Distill-Qwen-7B | 77.9 | 57.4 | 71.9 | 69.9 | 56.4 | 46.8 | 63.4 |
| GPT-4o | 69.7 | 45.9 | 72.1 | 57.3 | 50.5 | 53.4 | 58.2 |
| DeepCritic-7B-SFT | 67.1 | 48.0 | 59.2 | 61.2 | 46.0 | 43.0 | 54.1 |
| DeepCritic-7B-RL-Numina | 77.2 | 55.9 | 70.7 | 65.9 | 57.6 | 53.5 | 63.5 |
| DeepCritic-7B-RL-PRM800K | 77.3 | 60.1 | 74.0 | 72.9 | 60.9 | 57.2 | 67.1 |
Key ablation: Multi-perspective in-depth critiques correct both initial misses and false positives, and RL with human-labeled data outperforms auto-annotated alternatives.
Image Classification Domain
Datasets: CIFAR-10, CIFAR-100, SVHN
Findings:
- 2–5% improvement in test accuracy over best prior learned-loss (e.g., Learning Loss, TOD) and uncertainty-sampling baselines, particularly in low-label regimes.
- Expected Calibration Error halved (e.g., from 6 on CIFAR-10).
- In semi-supervised ablations, accuracy and calibration improve 7 over cross-entropy alone.
- Active learning cycles select datapoints with lowest predicted 8 for labeling, confirming CrtCl's effectiveness in uncertainty estimation (Rappazzo et al., 2024).
6. Algorithmic Workflows
LLM DeepCritic Training (Two-Stage)
9
Image Domain: Generator–Critic Training Loop
0
7. Significance and Implications
DeepCritic’s two-stage, reflective critique framework in LLMs establishes new state-of-the-art for automated error localization and actionable feedback in mathematical reasoning, outperforming comparable size models and even proprietary systems such as GPT-4o on several benchmarks. Multi-perspective, meta-level critique is critical for refining solution steps and providing sufficient guidance for downstream correction (Yang et al., 1 May 2025).
In image classification, DeepCritic/CrtCl delivers improved generalization and, crucially, dramatically better calibration in both purely supervised and semi-supervised/active learning settings. Critic-driven label selection in active learning is empirically superior to competing uncertainty and learned-loss approaches, especially under severe label budget constraints. A plausible implication is broader applicability to semi-supervised and reliability-critical vision tasks (Rappazzo et al., 2024).
The DeepCritic paradigm, whether in language or vision, demonstrates the advantages of explicit, trainable supervision signals provided by critic networks, especially when these are trained to reflect on model outputs via multi-perspective, adversarial, or uncertainty-driven mechanisms.