Critic-CoT: Iterative Self-Critique for LLMs

Updated 25 October 2025

Critic-CoT is a framework that enables LLMs to iteratively self-critique and refine their chain-of-thought reasoning through explicit error annotation.
It employs preference learning and majority-vote filtering to select the most reliable solution from multiple generated candidates.
Empirical results show improved accuracy in symbolic and mathematical reasoning while highlighting challenges with noise and distribution shifts.

Critic-CoT denotes a family of frameworks and methodologies that transform LLMs’ (LLMs) reasoning process through explicit step-wise self-critique, refinement, and preference learning. Unlike standard Chain-of-Thought (CoT) approaches—which elicit reasoning by prompting intermediate steps but typically lack meta-level validation—Critic-CoT equips models with the ability to systematically evaluate, annotate, select, and revise their own reasoning chains. This section formalizes Critic-CoT’s mechanisms, its data construction and training objectives, the observed gains in symbolic and mathematical reasoning, design implications, empirical controversies, and perspectives on scaling, robustness, and generalization.

1. Formalization and Critic-CoT Mechanisms

Critic-CoT (Zheng et al., 29 Aug 2024) centers on enhancing both self-critique and iterative refinement directly within LLMs. The process operates over explicit Chain-of-Thought solutions:

For a question $Q$ and an attempt $Att = [s_1, \ldots, s_n]$ with final prediction $Pred$ , a critic annotates each reasoning step with $L = [l_1, ..., l_n]$ (where $l_i \in \{+1, -1\}$ indicates correctness).
Critique can trigger either filtering or localized iterative refinement: upon detection of erroneous steps ( $-1$ in $L$ ), the model refines its reasoning starting at the earliest mistake.
Preference-based aggregation strategies (e.g., majority-vote filtering) use

$\hat{a} = \arg\max_a \sum_{i=1}^N \mathbf{1}(a_i = a)$

to select consensus answers from multiple candidates.

Empirical calibration of critic accuracy and recall is defined as:

$P = \frac{|\{Pred_i \neq Ans_i \land -1 \in L_i\}|}{|\{-1 \in L_i\}|}$

$R = \frac{|\{Pred_i \neq Ans_i \land -1 \in L_i\}|}{|\{Pred_i \neq Ans_i\}|}$

$\text{CriticAcc} = \frac{\sum_{i=1}^N [ (Pred_i = Ans_i \land -1 \notin L_i) \lor (Pred_i \neq Ans_i \land -1 \in L_i) ]}{N}$

Training is performed with auto-generated distant supervision data: solution attempts are produced, critiques are labeled (by stronger LLMs or by teacher models), and only attempts refined to the correct answer are retained.

2. System-2 Critique vs System-1 Feedback

Standard CoT enhancements (Self-Consistency, Self-Refine, naive filter) broadly resemble "System-1" reasoning: fast, shallow, primarily instance-level feedback. Critic-CoT specifically advances "System-2" analytic capabilities:

Step-wise evaluation replaces bulk output scoring, allowing targeted feedback (each chain segment or calculation is separately analyzed).
Critique annotation enables the model to discriminate subtle logical errors, not merely filter overtly incorrect entire chains.
Iterative refinement recursively corrects flaws, yielding increased robustness to local errors, especially in tasks where failure in one reasoning step can cascade.

In practice, Critic-CoT models consistently report increased top-1 and majority-vote accuracies on GSM8K and MATH datasets, with improvements up to 95.4% in test accuracy after critic-refine training and major-vote inference (Zheng et al., 29 Aug 2024).

3. Distant Supervision and Critique Data Construction

The Critic-CoT framework constructs supervision data without manual annotation:

For each question, a base generator produces candidate solutions with intermediate reasoning steps.
A teacher model (e.g., GPT-4-Turbo) or the model itself acts as the critic, labeling steps as correct/incorrect and producing refined solutions.
Supervision pairs $(attempt, critique, refinement)$ are accepted only when iterative refinement leads to the gold answer and critique labels properly localize failure points.

This automatic annotation strategy supports scalable training and distills strong critique behaviors without extensive human involvement.

4. Intrinsic Correlation Between Critique and Task-Solving Abilities

Experiments in Critic-CoT (Zheng et al., 29 Aug 2024) reveal that critique and reasoning abilities reinforce each other. Training models to identify and correct mistakes not only improves self-correction rates but additionally uplifts overall solution generation quality. Contrary to the notion that critique and reasoning ability compete for model capacity, Critic-CoT empirical results show mutual benefit: critique capability increases the likelihood that future chains are correct, leading to improved task-solving rates.

5. Symbolic Reasoning and Selective Application

The efficacy of Critic-CoT strongly aligns with symbolic and mathematical domains, as supported by meta-analyses (Sprague et al., 18 Sep 2024). On math and logic problems—where errors can be cleanly localized and corrected—iterative critique yields substantial gains. However, in commonsense or knowledge-based tasks, the marginal benefits are limited, and direct answering can rival or surpass chain-of-thought plus critique approaches.

A plausible implication is that Critic-CoT architectures should be selectively applied to symbolic domains (math, code synthesis, algorithmic reasoning) or in settings where intermediate step validation is tractable and valuable.

6. Preference Optimization, Efficiency, and Hybrid Reasoning Strategies

Recent extensions integrate Critic-CoT with preference learning and efficiency optimization (Luo et al., 30 Apr 2025). Hybrid models can produce both long and short CoT chains; bi-level training strategies drive the model to select reasoning pathways that balance brevity, correctness, and preference margins. Compression of chains-of-thought is encouraged within correct solution groups, and adaptive mode selection gates the inference flow between reflective (long CoT) and concise (short CoT) strategies.

This approach results in substantial reductions in inference cost (average reasoning length reduced by over 50% on mathematical datasets), with only marginal accuracy drops.

7. Scaling, Robustness, and Limitations

Critic-CoT’s main design strengths are scalability and robustness against local errors. Distant supervision avoids annotation bottlenecks; step-wise refinement localizes and mitigates intermediate step errors. Nevertheless, robustness to distribution shift and noisy intermediate steps remains a challenge. Studies indicate that CoT-based architectures—including Critic-CoT variants—are sensitive to data corruption in step annotations and to discrepancies between train/test distributions (Yin et al., 12 Jun 2025).

Similarly, in pattern-based in-context learning and implicit reasoning settings, explicit critique may not fully overcome the limitations of CoT prompting, especially when underlying patterns are hard to verbalize (Zheng et al., 7 Apr 2025).

8. Theoretical Underpinnings and Information Measures

The statistical theory for CoT supervision (Altabaa et al., 21 May 2025) provides rigorous sample complexity bounds for Critic-CoT: chains-of-thought carry discriminative information not present in end-to-end outputs. The information measure

$\cotinfo(\epsilon;\calH) = \inf_{h \in \Deltaete_{\calD}(\epsilon; \calH, h_*)} -\log \Pr_{x \sim \calD} \left[ \hcot{h}(x) = \hcot{h_*}(x), \hete{h}(x) = \hete{h_*}(x) \right]$

shows that high-informativeness chains-of-thought dramatically reduce required sample size. This implies that Critic-CoT systems can exploit critiques of reasoning traces to accelerate learning and improve generalization.

9. Extensions: Collaborative Critique, Multimodal Domains, and Reinforcement Learning

Critic-CoT is extensible to collaborative settings (Co-CoT (Yoo, 23 Apr 2025)), where human users can edit, analyze, and adapt modular reasoning blocks, and models respond with preference-sensitive iterations. In multimodal vision-language domains, self-critic and rationale-augmented frameworks (Re-Critic (Yang et al., 12 May 2025), LLaVA-Critic-R1 (Wang et al., 31 Aug 2025)) combine critic data and RL training for both generation and evaluation, producing unified models with strong “think-then-evaluate” cycles and benchmark-leading performance on tasks such as MMMU.

In code synthesis and broader deductive reasoning, Critique Reinforcement Learning (CRL (Ruan et al., 26 Sep 2025)) requires models to produce explicit critiques as part of the training objective, resulting in improved reasoning and generalization across both code and symbolic tasks.

10. Controversies and Perspectives

There is consensus that Critic-CoT frameworks bring tangible gains in stepwise reasoning domains with clearly verifiable intermediate steps. However, their effectiveness is domain-dependent, sensitive to data conditions, and at times bounded by the quality of explicit rationales and robustness to information loss. The scaling of Critic-CoT to diverse tasks—especially those lacking a strong structural notion of intermediate correctness—remains an open research area. Some competing perspectives conceptualize CoT and Critic-CoT as constraints that guide imitation rather than true reasoning (Shao et al., 3 Jun 2025), highlighting ongoing debate around the nature of emergent reasoning in LLMs.

Summary Table: Critic-CoT Features Across Representative Papers

Aspect	Symbolic Math (GSM8K/MATH)	Commonsense Reasoning	Multimodal VLMs
Critic-CoT Gains	Significant (>95% top-1)	Limited/Marginal	Strong with reasoned critic
Iterative Refinement	Effective error correction	Weak for pattern tasks	Used for self-evaluation
Distant Supervision	Scalable	Applicable	Applies to critic data
Preference Filtering	Major-vote, DPO	Less impactful	RL/Policy/Format reward
Robustness to Noise/Shift	Sensitive	Sensitive	Sensitive to rationale
Collaborative Extension	Modular blocks (Co-CoT)	Possible	Modular via critic blocks

In summary, Critic-CoT comprises a suite of architectures and strategies for step-wise self-critique, iterative refinement, and preference-based selection of reasoning chains within LLMs. These frameworks excel at enhancing symbolic and mathematical reasoning, especially when tasks admit explicit, verifiable steps; they underpin scalable methods for auto-supervised training and robust solution selection, but also invoke open questions concerning domain coverage, sensitivity to noise and data shifts, and the intersection of genuine reasoning with pattern-guided imitation.