Stepwise Think-Critique: Structured LLM Evaluation

Updated 24 December 2025

STC is a framework that decomposes complex reasoning into alternating 'think' and 'critique' steps to systematically localize and refine errors.
It leverages both dual-model critic-refiner architectures and unified interleaved models to provide clear, step-level feedback during inference and training.
STC enhances data efficiency and model robustness by integrating step-level supervision, reinforcement learning, and process-level verification in LLMs.

Stepwise Think-Critique (STC) is a framework that unifies reasoning and evaluation by interleaving step-level generation (“think”) and fine-grained critique (“critique”) within or across LLMs. Originating to address deficiencies in shallow, instance-level self-critique, STC has become a central paradigm for robust multi-step reasoning, interpretable verification, and process-level supervision in both mathematical reasoning and long-form generation. By emulating System-2 human analytic processes, STC systematically localizes errors and enables targeted refinement of intermediate reasoning, synthesizing both training and inference-time techniques for improved accuracy, interpretability, and data efficiency.

1. STC Formalism and Core Principles

Stepwise Think-Critique (STC) departs from holistic judgments over entire outputs by decomposing reasoning into alternating cycles of explicit “thinking” (generating a CoT or solution chunk) and “critiquing” (step-level evaluation) (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025, Xu et al., 17 Dec 2025). In its canonical instantiation, this process follows:

Think: The model generates a step or chunk (e.g., s₁, s₂, ..., sₙ for math, or S₁, ..., S_T for text).
Critique: For each generated step i, a critic—either a separate LLM module or the same policy under a critique prompt—delivers a fine-grained label (e.g., +1/–1, correct/incorrect, or binary score) and, in advanced variants, also provides justification or meta-critique (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025, Xu et al., 17 Dec 2025).
Refine or Proceed: If all labels are positive, the solution proceeds; if a negative label is encountered, only the offending portion is revised (e.g., Att^{k+1} ← Refine(Q,Att^{k},j)) (Zheng et al., 29 Aug 2024).
Termination: The loop continues until the full reasoning trace is accepted or a maximum number of rounds/restarts is reached.

STC can be implemented as:

Dual-model critic-refiner architectures: A dedicated critic labels and enforces stepwise revision (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025).
Unified interleaved models: Alternating (reasoning, critique) token pairs are produced in a single autoregressive LLM (Xu et al., 17 Dec 2025).

The following table summarizes archetypal STC pipelines found in key contributions:

System	Architecture	Critique Content	Refinement	RL/Process Supervision
Critic-CoT	Separate critic+refiner	+1/–1 label	Yes	Distant supervision
DeepCritic	Fine-tuned deliberate critic	CoT/meta-reflective	Yes	RL (GRPO), SFT
STC-Unified	Interleaved in one policy	NL + binary label	n/a	Hybrid RL (GRPO)
ThinkPRM	Generative CoT verifier	CoT, “\boxed{label}”	n/a	MLE (CoT labels)
StepWiser	Generative judge	“Analysis: ...” + box	n/a	RL (GRPO, MC signals)
PANEL	Self-prompted NL critiques	NL explanation	n/a	None (inference only)
LongDPO	MCTS + external critique	Structured NL critique	Yes	Step-level DPO

2. STC Training Paradigms and Objectives

STC frameworks utilize a range of training objectives and data pipelines to induce deliberate, interpretable critique capacity:

Supervised Fine-Tuning (SFT): Seed datasets of (problem, solution, step labels/critiques) pairs are generated, often with a strong LLM (e.g., Qwen2.5-72B, GPT-5) serving as a teacher. Fine-tuning learns to emit coherent, multi-perspective or meta-reflective critiques, with objective

$L_{SFT} = -\mathbb{E}_{(P,S,C)\sim D_{SFT}}[\log P_\theta(C|P,S)]$

(Yang et al., 1 May 2025, Xu et al., 17 Dec 2025).

Reinforcement Learning (GRPO and Variants): Hybrid objectives combine rewards for correct reasoning outcomes, critique accuracy (consistency between model critique and true correctness), and format compliance. Specifically, STC frameworks employ a grouped reinforcement policy optimization (GRPO) (Xu et al., 17 Dec 2025), with reward components:
- $R_{reason} = \mathbf{1}[r_T = y]$
- $R_{crit} = \mathbf{1}[s_T = \mathbf{1}[r_T = y]]$
- $R_{format} = (1/T)\sum v_n$

Stepwise, dense critique signals (e.g., normalized $s_n$ ) are used as token-level advantages to stabilize and accelerate credit assignment (Xu et al., 17 Dec 2025).

Process Supervision and DPO: For long-form generation, stepwise preference pairs (chosen, rejected) are extracted using Monte Carlo Tree Search, augmented with external critiques, and the model is trained under a step-level DPO loss (Ping et al., 4 Feb 2025).
Monte-Carlo and Self-Distillation: Datasets can be built without manual annotation via automatic error localization using a teacher model, self-distilled over process-trace sampling and re-critique (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025).

3. Inference-Time Algorithms and Deployment

STC methods provide both explicit inference-time enhancement strategies and integrated generation modes:

Iterative refinement: Errors flagged at any step trigger targeted regeneration from the first mistake (Zheng et al., 29 Aug 2024).
Critic as filter: Multiple candidate solutions are generated; those passing all critique steps are kept, and answers are aggregated by majority vote or best-of-K selection (Zheng et al., 29 Aug 2024, Khalifa et al., 23 Apr 2025).
Process-level best-of-N and reward-guided search: Generative STC verifiers (e.g., ThinkPRM, StepWiser) assign scores based on the “think–critique” trace; candidates are ranked and selected accordingly (Khalifa et al., 23 Apr 2025, Xiong et al., 26 Aug 2025).
PANEL-style search: At each reasoning step, candidate continuations are augmented with natural-language self-critiques, and selection is performed by the policy model fed both step and critique (Li et al., 21 Mar 2025).
Unified compact/full inference: Interleaved reasoning–critique outputs enable error localization (full mode) or raw solution emission (compact mode) (Xu et al., 17 Dec 2025).
Long-form MCTS with external critique: Stepwise candidate expansion, scoring, and selection is mediated by external natural language critique generation and suggestion incorporation (Ping et al., 4 Feb 2025).

4. Empirical Results and Impact

STC consistently delivers empirical improvements in both reasoning and verification metrics across mathematical, general, and long-form generation benchmarks.

Math Benchmarks (GSM8K, MATH, AIME):
- Critic-CoT: +3.7pp over Llama-3-70B-Instruct on GSM8K (93.3% top-1, 95.4% majority vote), +14pp on MATH (65.0%) (Zheng et al., 29 Aug 2024).
- ThinkPRM: Outperforms discriminative counterparts by +14–18 F1 (OlympiadBench/OmniMath) using only 1% of supervision (Khalifa et al., 23 Apr 2025).
- DeepCritic: DeepCritic-7B-RL-PRM800K achieves 67.1 F1 on process error localization, +10 over GPT-4o (Yang et al., 1 May 2025).
- StepWiser: Raises judgment F1 by +23 on ProcessBench (SFT: 38.9, StepWiser: 61.9), with +5.8 downstream solution accuracy (Xiong et al., 26 Aug 2025).
Inference Time Scaling:
- Critique-guided best-of-K: Yields +10–26.7 ppt over majority voting for K∈{2,…,16} (Xu et al., 17 Dec 2025).
- Critique-based beam search: ThinkPRM-1.5B@4 = 75% vs. DiscPRM@4 = 67% on MATH-500 (Khalifa et al., 23 Apr 2025).
- PANEL: Outperforms scalar step-level evaluators by up to +5.6% on GPQA and AIME (Li et al., 21 Mar 2025).
Long-form Generation:
- LongDPO: On LongBench-Write, length and quality scores improved by up to +8 on the longest bucket; human evaluators preferred output in 60%+ of pairwise trials (Ping et al., 4 Feb 2025).

Qualitative results indicate STC models can localize the first incorrect step, provide actionable feedback for refinement, and produce transparent reasoning traces supporting post-hoc error analysis or process auditing (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025, Xu et al., 17 Dec 2025, Khalifa et al., 23 Apr 2025, Xiong et al., 26 Aug 2025, Li et al., 21 Mar 2025).

5. Data Efficiency, Robustness, and Interpretability

A salient advantage of generative STC frameworks is marked data efficiency coupled with interpretability:

Data Efficiency: ThinkPRM achieves superior verification accuracy using only ~1% of full process label budgets (8K labels vs. 712K) (Khalifa et al., 23 Apr 2025). Similarly, DeepCritic and Critic-CoT leverage automatically synthesized and self-distilled data (Zheng et al., 29 Aug 2024, Yang et al., 1 May 2025).
Robustness: Out-of-domain generalization is consistently observed. ThinkPRM verifies GPQA-Physics at +8 F1 compared to discriminative baselines trained on full PRM800K (Khalifa et al., 23 Apr 2025).
Interpretability: The explicit natural language or CoT critique at each step provides a transparent basis for model decisions, error localization, and diagnosis. Dense stepwise rewards further enhance reward shaping without loss of process visibility (Xu et al., 17 Dec 2025, Xiong et al., 26 Aug 2025, Yang et al., 1 May 2025).
Test-time scaling: Critique-guided selection boosts sample efficiency, and dynamic allocation of inference compute allows for flexible deployment (Khalifa et al., 23 Apr 2025).

6. Limitations, Open Challenges, and Future Directions

Common challenges and research frontiers identified across STC studies include:

Computational cost: Monte Carlo rollouts for dense stepwise RL are expensive (e.g., 14 days on 8×A100 for StepWiser) (Xiong et al., 26 Aug 2025).
Domain adaptation: Most existing STC deployments focus on math reasoning; extending to code generation, commonsense, and planning tasks presents additional challenges (Yang et al., 1 May 2025, Xiong et al., 26 Aug 2025).
Label granularity: Most approaches use coarse binary stepwise rewards; progress toward continuous or multi-class critique signals is an open direction (Xiong et al., 26 Aug 2025).
Critique calibration: Noisy or miscalibrated critiques, especially from self-prompted models, can periodically misguide search or refinement (Li et al., 21 Mar 2025).
Scaling seed data synthesis: Current approaches still rely on extremely strong teacher LLMs for high-quality data generation; fully automatic, scalable data augmentation remains unsolved (Yang et al., 1 May 2025).
Task-general process supervision: Hybrid schemes combining scalar and critique-based feedback, human-in-the-loop refinement, and curriculum learning are under exploration (Yang et al., 1 May 2025).
Overfitting and model collapse: Imbalanced stepwise outcome distributions demand entropy regularization and prompt-level balancing during RL (Xiong et al., 26 Aug 2025).
Long context and memory management: Maintaining coherence and high-quality critique across thousands of tokens (e.g., in LongDPO) remains challenging (Ping et al., 4 Feb 2025).

7. Relationship to Existing Paradigms and Broader Implications

STC architectures generalize and unify diverse verification and reward modeling strategies:

Systematized Critique: Unlike classical post-hoc process reward models (PRMs), STC mechanisms embed critique into generation, providing a persistent stream of verifiable meta-reasoning (Zheng et al., 29 Aug 2024, Khalifa et al., 23 Apr 2025).
Comparison to Scalar Verifiers: STC surpasses single-number evaluation on information richness, qualitative feedback, and nuanced error localization (Li et al., 21 Mar 2025, Khalifa et al., 23 Apr 2025).
Test-time compute allocation: Generative STC critics enable flexible and granular allocation of inference resources (e.g., length of CoT verification or number of parallel self-checks) (Khalifa et al., 23 Apr 2025).
RL Integration and Policy Learning: Hybrid RL objectives over dense stepwise rewards accelerate training and yield simultaneously robust process evaluators and improved generator behavior (Xu et al., 17 Dec 2025, Xiong et al., 26 Aug 2025).
Process supervision in long-form generation: Stepwise preference modeling and critique-augmented selection (as in LongDPO) are critical for length and consistency control in ultra-long contexts (Ping et al., 4 Feb 2025).

A plausible implication is that STC provides a principled foundation for constructing LLMs with built-in critical thinking, transparent decision-making, and scalable process-level oversight, offering a structured alternative to black-box outcome-only evaluation and supervision.