Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Critique in LLM Planning

Updated 6 April 2026
  • Self-Critique LLM Planning is a framework where an LLM generates and iteratively critiques its own plans to enhance validity and robustness.
  • The methodology employs a two-loop structure and modular architectures, integrating natural language feedback, Bayesian inference, and external verifiers.
  • Empirical evaluations reveal that while naive self-critique can suffer from false positives, modular actor–critic approaches significantly improve plan precision.

Self-critique LLM planning denotes planning workflows and algorithmic frameworks in which a LLM is tasked not only with generating candidate plans, but also with critiquing—i.e., verifying, evaluating, or refining—its own plans, typically through one or more internal iterative loops. The underlying hypothesis is that by endowing LLMs with explicit self-verification mechanisms, these systems can iteratively improve plan correctness, robustness, and overall performance, with or without recourse to external symbolic verifiers or oracles. The empirical and theoretical literature presents conflicting findings on the reliability, limitations, and best practices for self-critique in LLM-driven planning pipelines, spanning symbolic planning, open-ended reasoning, multi-step mathematical and algorithmic inference, and agentic decision-making.

1. Foundational Frameworks and Mathematical Formulation

The standard formalization of self-critique LLM planning adopts a two-loop structure. Define a planning problem as a tuple (D,s0,G)(D, s_0, G) over a domain DD (with actions, preconditions, and effects), an initial state s0s_0, and a goal condition GG. The LLM is first used as a plan generator, GLLMG_{\mathrm{LLM}}, to produce a plan P0P_0. Then, either the same or another instance of the LLM (and optionally a distinct model) serves as the verifier/critic VLLMV_{\mathrm{LLM}}:

P0GLLM(D,s0,G)P_0 \leftarrow G_{\mathrm{LLM}}(D, s_0, G)

for i=0:N1 do{if VLLM(Pi)=valid then return Pi else, generate feedbackiCritiqueLLM(Pi) Pi+1GLLM(D,s0,G,feedbacki)\text{for }i = 0{:}N-1\text{ do} \left\{ \begin{array}{ll} \text{if } V_{\mathrm{LLM}}(P_i) = \text{valid}\text{ then return }P_i \ \text{else, generate feedback}_i \leftarrow \mathrm{Critique}_{\mathrm{LLM}}(P_i) \ P_{i+1} \leftarrow G_{\mathrm{LLM}}(D, s_0, G, \text{feedback}_i) \end{array} \right.

where NN is a fixed iteration cap.

A core feature is the alternation between plan generation and plan critique, which may be instantiated by the same LLM (“intrinsic” self-critique (Bohnet et al., 30 Dec 2025)), by two LLMs with segregated contexts or parameterizations (actor–critic splits (Fan, 26 Nov 2025, Yang et al., 20 Mar 2025)), or via more modular architectures involving ensemble or merged weights with explicit critic heads (Gallego, 2024).

In some systems, the self-critique loop is formalized probabilistically using latent-variable Bayesian inference—where the critique becomes an auxiliary variable mediating Gibbs sampling and the acceptance step may be handled via a Metropolis–Hastings update under a reward model (Gallego, 2023).

2. Empirical Evaluation and Limitations

Multiple empirical studies demonstrate that naive self-critique, with a single LLM used for both plan generation and verification (LLM+LLM), often suffers from high false-positive rates—i.e., invalid plans incorrectly marked as valid by the LLM verifier. In classical PDDL and STRIPS-style planning domains such as Blocksworld and Mystery Blocksworld, Valmeekam et al. (Valmeekam et al., 2023) report that:

  • One-shot LLM plan generation yields 40% plan validity.
  • LLM+LLM iterative self-critique increases correctness modestly to 55%, but at a significant cost relative to LLM+external verifiers (88% validity with an external checker).
  • Critique accuracy is undermined by pervasive false positives (e.g., 84% false-positive rate for plan verification), leading to premature acceptance of incorrect plans.
  • The granularity or content of feedback (binary vs. error explanations) shows minimal additional impact once the verifier itself is unsound.

A parallel study by Stechly et al. (Stechly et al., 2024) confirms that iterative self-critique with LLMs can actually degrade overall solution rates in various formally verified domains, including algorithmic and symbolic tasks, compared to both naive generation and to systems with reliable external verification. They further demonstrate that the classical intuition—"verification is easier than generation"—does not translate to the LLM regime, due to the retrieval/bias properties of the models.

In contrast, deploying an external, sound symbolic verifier (e.g., VAL) for the verification step, when feasible, closes most gaps in correctness and yields large performance gains for both planning (Valmeekam et al., 2023, Stechly et al., 2024) and reasoning workflows.

3. Advances in Self-Critique Architectures

Recent approaches address the limitations of basic self-critique through modularization, training, and architectural improvements:

Separation of Roles: The Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR) pipeline (Fan, 26 Nov 2025) decomposes the planning process across three distinct agents (actor, critic, refiner) and integrates environment-specific knowledge graphs to explicitly align plan verification with environmental feasibility, sharply reducing spurious self-justification. Dedicated critics, structured feedback with subgoal feasibility tracing, and selective refinement introduce modularity and error locality not achievable in single-LLM self-critique.

Stepwise Natural Language Critique: The PANEL framework (Li et al., 21 Mar 2025) operationalizes step-level plan search with explicit natural language self-critiques of each candidate, shown to outperform scalar reward-based and pure self-evaluation strategies, especially on multi-step reasoning and planning tasks. The algorithm alternates candidate expansion, stepwise self-critique, and informed selection, systematically preserving high-dimensional feedback at every planning stage.

Unified Self-Critique Heads: Stepwise Think-Critique (STC) (Xu et al., 17 Dec 2025) integrates self-critique as an interleaved mode within the same model, trained with reinforcement learning objectives that jointly optimize for stepwise reasoning correctness, consistency between reasoning and critique, and well-formed explanatory traces. RL-based dense shaping advantages propagate critique-consistency rewards throughout the full plan trace, enhancing both performance and interpretability.

Bayesian Self-Critique and Distillation: A Bayesian framework (Gallego, 2023) treats self-critique as latent-variable inference, alternating critique (diagnosis) and revision steps, and amortizes the resulting improved posterior via a separate distilled model (dSC) for efficient inference-time deployment. This yields LLMs that internalize the benefits of iterative self-critique without requiring repeated loops at inference.

Merged Actor-Critic Models: Model parameter merges between a base LLM and a pre-trained critic head (Gallego, 2024) instantiate strengthened self-critique capabilities for adversarial robustness, notably reducing the success rate of "jailbreak" prompts by synchronizing the base LLM's judgment with rigorous, structured critical feedback throughout all inference stages.

4. Benchmarking and Quantitative Results

A variety of benchmarks and evaluation measures have been used to quantify self-critique effectiveness:

Core Metrics: Plan-generation accuracy (fraction of valid plans), true/false positive and negative rates for verification, critique precision/recall/F1 (Lin et al., 2024), and downstream performance post-correction and refinement iterations.

CriticBench GQC Framework: CriticBench (Lin et al., 2024) introduces the GQC (Generation, Quality control/Critique, Correction) schema, and demonstrates nearly linear scaling between generation and critique scores, with correction accuracy heavily dependent on both critique accuracy and domain complexity.

System Plan Validity (Blocksworld) Critique FPR Best Use Case
LLM, no self-critique 40% Fast one-shot baselines
LLM+LLM self-critique 55% 84.4% (FP rate) Marginal gain in absence of verifier
LLM+Sound Verifier 88% NA Best correctness, critical tasks
Intrinsic self-critique 85–89% (Blocksworld, up to) SoTA in selected domains (2024)

Empirically Verified SOTA: The intrinsic self-critique method of (Bohnet et al., 30 Dec 2025) demonstrated 85–89% correctness on Blocksworld and Logistics (Oct 2024 checkpoint), outperforming previous self-critique baselines (49–57%), but still not matching idealized “oracle” perfect verification (91–95%).

Stepwise and Modular Critique Gains: Stepwise natural language self-critique (PANEL (Li et al., 21 Mar 2025)) and actor/critic/refiner splits (SGA-ACR (Fan, 26 Nov 2025)) consistently outperform scalar-verifier and one-shot approaches, especially where error localization or qualitative feedback are essential.

5. Practical Design Patterns and Best Practices

Best practices emerging from the literature for deploying self-critique in LLM planning include:

6. Open Problems, Controversies, and Future Directions

There remains substantial controversy regarding the effectiveness, reliability, and generalizability of self-critique strategies:

  • Multiple studies underscore the unreliability of self-generated critiques versus external, ground-truth verifiers, especially for domains requiring precise symbolic reasoning or detection of subtle plan infeasibilities (Valmeekam et al., 2023, Stechly et al., 2024).
  • Some research achieves significant gains using intrinsic self-critique, particularly via prompt engineering, iterative refinement, and ensemble self-consistency (Bohnet et al., 30 Dec 2025). Nonetheless, the observed upper bounds remain below what is achieved with perfect symbolic verification.
  • Merged and multi-agent architectures show promising robustness advances, particularly for adversarial and safety-critical planning, but rely on careful tuning of critic weights and alignment of actor-critic objectives (Gallego, 2024, Yang et al., 20 Mar 2025).
  • The field continues to explore hybrid regimes fusing natural language feedback, reward modeling, and symbolic or external validators, as well as joint training regimes in which critique-consistency is a first-class reward (Xu et al., 17 Dec 2025).
  • Diagnostic frameworks such as CriticBench (Lin et al., 2024) are vital for quantifying critique and correction capacities as a function of model size, domain, and architectural choices.

Self-critique remains a key area of active research in LLM planning, algorithmic reasoning, and agentic control: improvements in algorithmic structures, reward shaping, grounding, and critique-specific training are progressively narrowing the gap between intrinsic LLM self-verification and the reliability offered by classical, formal verifiers. However, expert consensus to date indicates that, when correctness is non-negotiable, self-critique alone cannot yet supplant external symbolic verification, but does offer valuable supplementary mechanisms, particularly when external ground truth is unavailable or infeasible.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Critique LLM Planning.