System-2-like Critic Capability in Neural Agents

Updated 19 December 2025

System-2-like critic capability is a neural approach that interleaves reasoning steps and explicit self-critique to verify and improve model outputs.
It employs intertwined reasoning–critique sequences and dense reinforcement learning signals to optimize both accuracy and introspection.
By integrating chain-of-thought verification at every sub-step, it yields interpretable error correction and robust performance in complex problem-solving.

System-2-like critic capability refers to an LLM or neural agent’s capacity for slow, deliberative, reflective evaluation of reasoning processes, explicitly modeled on dual-process theories of human cognition. Unlike System-1, which is fast, intuitive, and often error-prone, a System-2-like critic interleaves step-wise reasoning and self-evaluation, yielding robust verification, introspective error-detection, and interpretable self-critique. In recent research, this capability is instantiated through intertwined reasoning–critique action sequences, dense or trajectory-level RL signals for both reasoning and critique quality, and architectural designs that enforce chain-of-thought verification at every sub-step. The following sections synthesize recent state-of-the-art methods, formal objectives, algorithms, and empirical findings as outlined in the Stepwise Think-Critique (STC) framework, as well as contextualizing work from contemporaneous RLHF and chain-of-thought–based critic models.

1. Defining System-2-like Critic: Architectural and Formal Principles

The System-2-like critic paradigm operationalizes a sequential interleaving of reasoning and evaluation within a single model. Given input $x$ , a decoder alternates:

$r_1 \to c_1 \to r_2 \to c_2 \to \cdots \to r_T \to c_T$

where each $r_t$ is a reasoning step and $c_t$ is a natural language self-critique comprising justification and a binary self-evaluation score. The model’s internal state $s_i$ is a deterministic summary of the prompt, prior reasoning, and critiques. The policy $\pi_\theta(a_i, c_i | s_i)$ generates a joint reasoning–critique super-action at each step. Value heads $V_\phi^{\mathrm{reason}}(s_i)$ , $V_\psi^{\mathrm{crit}}(s_i)$ predict expected returns for reasoning and critique consistency, enabling dense advantage estimation throughout the trajectory (Xu et al., 17 Dec 2025).

This approach unifies reasoning and self-correction, providing immediate stepwise feedback and interpretability. Critiques may be enabled at inference (“full mode”) for detailed traces or omitted for brevity (“compact mode”). Comparison with architectures separating policy from verifier (as in conventional RLHF or after-the-fact reranking) demonstrates that such intertwining reduces system complexity and synchronizes learning.

2. Hybrid Reinforcement Learning Objectives for Stepwise Critique

STC introduces a composition of scalar trajectory-level and dense stepwise rewards to optimize both reasoning accuracy and self-critique quality. For a complete rollout $\tau = (r_1, c_1, ..., r_T, c_T)$ :

Reasoning reward: $R_{\text{reason}}(\tau) = \mathbbm{1}[r_T = y]$
Critique-consistency reward: $R_{\text{crit}}(\tau) = \mathbbm{1}[s_T = \mathbbm{1}[r_T = y]]$
Format reward: $R_{\text{format}}(\tau) = \frac{1}{T} \sum_{n=1}^T v_n$

Additionally, stepwise dense critique signals $s_n \in \{0,1\}$ provide immediate feedback for each step.

The aggregate per-token advantage $\widetilde{\mathcal{A}}_{k,t}$ mixes normalized reasoning, critique, format, and dense stepwise advantages, weighted by hyperparameters $\lambda_{\mathrm{reason}}, \lambda_{\mathrm{crit}}, \lambda_{\mathrm{format}}, \lambda_{\mathrm{dense}} \geq 0$ .

The RL update employs a PPO-style clipped surrogate with KL penalty (Grouped Reinforcement Policy Optimization, GRPO):

$\begin{aligned} \mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{x, \{\tau_k\}} \Bigg[ \frac{1}{G} \sum_{k=1}^G \frac{1}{|\tau_k|} \sum_{t=1}^{|\tau_k|} \min \{ \rho_{k,t} \widetilde{\mathcal{A}}_{k,t}, \mathrm{clip}(\rho_{k,t},1-\epsilon,1+\epsilon)\widetilde{\mathcal{A}}_{k,t}\} \Bigg] - \beta \mathbb{E}_x[D_{\mathrm{KL}}(\pi_\theta(\cdot|x)\|\pi_{\mathrm{ref}}(\cdot|x))] \end{aligned}$

where $\rho_{k,t}$ is the ratio of current to previous policy probabilities per action, and $\beta$ regularizes the policy to remain close to a fixed reference (Xu et al., 17 Dec 2025).

3. Training Pipeline: Supervised and Policy Optimization Stages

The STC protocol deploys a two-stage training procedure:

Supervised Fine-Tuning (SFT):

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(x,\tau)}[\log \pi_\theta(\tau|x)]$

SFT is performed over synthesized (problem, trajectory) pairs to initialize the model’s joint reasoning–critique output.

Reinforcement Learning with GRPO:
- Group size: $G=16$
- Stage-specific weights: e.g., $\lambda_{\mathrm{format}}=0.05$ , $\lambda_{\mathrm{dense}}=0.5$ , 1200 updates
- Learning rates in the $1\text{e-}5$ to $5\text{e-}6$ range
- Critique-consistency and stepwise signals computed on-the-fly

Algorithmic pseudocode (abbreviated for clarity):

Given: π_θ, π_ref, G, ε, β
for each minibatch x_1, ..., x_B:
  for k=1...G:
    τ_k ~ π_{θ_old}(·|x)
    Compute r_reason, r_crit, r_format, {s_n}
    Compute advantages, surrogate losses, KL penalty
    θ ← θ - α∇θ(–J_GRPO(θ))
  π_{θ_old} ← π_θ

4. Evaluation Methodology and Benchmarking Results

STC is validated on five mathematical reasoning suites: AIME24, AMC23, MATH-500, Minerva, and OlympiadBench, measuring both standard solution accuracy (Pass@1, Pass@8) and critique efficacy (F1 and specificity).

Model Variant	Pass@1	Pass@8	Critique F1 (final answer)
Base DS-Qwen-1.5B	41.2	60.7	—
STC-SFT	39.1	59.1	—
STC-GRPO (compact)	48.3	67.4	—
STC-GRPO (full)	49.6	67.1	60.8 (answer)

STC-GRPO achieves process-step critique specificity ≈53% (vs. ≈14% for SFT) (Xu et al., 17 Dec 2025). The immediate insertion of binary critiques after each substep enables interpretable and densely-supervised traces for error localization.

5. System-2 Capabilities: Mechanisms, Limits, and Interpretability

By emitting explicit $\langle\texttt{critic}\rangle$ … $\langle\texttt{score}\rangle$ tokens at every sub-step, the model embeds a reflective inner loop that mirrors the slow, analytic, stepwise checking characteristic of System 2 cognition. The output sequence contains both justifications and scored self-assessments, which:

Provide immediate feedback for every sub-step of reasoning
Supply dense learning signals during policy optimization
Produce interpretable error traces, facilitating inspection and debugging

Key limitations include:

Empirical gains currently demonstrated on 1.5B-parameter models
Critique accuracy remains suboptimal (F1 ≈60%)
Hyperparameter sensitivity (reward weights, learning rates)
Absence of multimodal or large-scale trials

Notably, empirical results confirm the correlation between stepwise critique and solution quality, indicating that joint reasoning–critique optimization can drive both upstream accuracy and downstream reliability.

6. Relation to Other Critic Frameworks and the Broader Literature

The STC approach is situated in a broader landscape of System-2 critic architectures:

Chain-of-Thought Critic Models: Critic-CoT (Zheng et al., 2024) and RefCritic (Tang et al., 20 Jul 2025) model stepwise judgment for error localization and iterative refinement. Critic-CoT, for example, enforces strict alignment between the critic’s first flagged error and solution repair, boosting both reasoning and meta-judgment accuracy.
Adversarial Self-Play: SPC (Chen et al., 27 Apr 2025) evolves a step-level critic via adversarial games against a deliberate “sneaky generator,” thus inducing robust error-detection skills in the absence of human step-level annotation.
Closed-Loop Correction Benchmarks: RealCritic (Tang et al., 24 Jan 2025) and CriticBench (Lin et al., 2024) formalize critique quality in terms of the downstream improvement it enables after correction—emphasizing systemic, multi-round refinement as opposed to surface label accuracy.
Hybrid System-2 Designs: Actor–critic and planning frameworks (e.g., Critic PI2 (Fan et al., 2020)) exploit a value function (“critic”) to select and refine action trajectories, thus explicitly implementing multi-step deliberation and outcome-driven verification.

The STC paradigm can be regarded as a direct neural operationalization of dual-process theory, where the system interleaves action and evaluation in a tightly coupled Markovian process, leveraging stepwise RL signals to shape both the forward reasoning trajectory and its concurrent self-assessment.

7. Open Questions, Limitations, and Future Directions

Areas identified for future research and improvement include:

Scaling to larger, multimodal, or cross-domain settings
Further improving critique accuracy and robustness, especially at the process (step) level
Developing methods for hierarchical critique, memory management, and chunked review (cf. working memory constraints in (Lowe, 2024))
Extending immediate feedback mechanisms to tree- or graph-based planning, enabling more expressive forms of System-2 deliberation
Enhancing alignment and safety properties by leveraging the built-in critic for real-time oversight and correction of policy behaviors

The explicit integration of reasoning and critique in models such as STC represents a substantive advance towards inherent, human-like critical thinking in LLMs. The unified, end-to-end joint policy approach reduces the architectural complexity seen in pipeline designs, supplies efficient and interpretable meta-cognitive routines, and establishes a rigorous foundation for the next generation of robust, intrinsically trustworthy AI systems (Xu et al., 17 Dec 2025).