Critic Agents: Evaluators in AI

Updated 15 April 2026

Critic agents are explicit algorithmic modules that evaluate and assign credit to an actor’s outputs using methods like retrospective, step-wise, and discrimination-stage feedback.
They employ techniques such as value-function estimation, template matching, and advantage shaping to offer dense, actionable supervision and improve sample efficiency.
Their applications span multi-agent systems, multi-hop QA, GUI automation, and financial QA, consistently delivering significant performance gains and error localization.

A critic agent is an explicit subsystem—typically an algorithmic module or an instantiation of a neural network, LLM, or prompt-driven policy—whose core function is to evaluate, diagnose, and assign credit or blame to the actions or outputs of another agent (the "actor" or "policy network"). Critic agents generalize classic value-function critics in reinforcement learning (RL) to a broader range of modalities, including language-based tool use, multi-agent systems, feature engineering, and code review. They are essential for high-signal credit assignment, stability in policy optimization, error localization, and robust decision making across autonomous and human-AI collaborative pipelines.

1. Canonical Architectures and Formal Roles

Critic agents are best categorized by their architectural integration and the granularity of their diagnostic output.

Value-function critics (classic RL): Estimate (state, action)-value functions for actor-critic RL, determining the scalar value of a state or action to drive policy gradients. For example, in multi-agent RL, this is formalized as $Q^\text{c}(s, a)$ for centralized critics, or $Q^\text{h}(h, a)$ for decentralized, history-based critics (Lyu et al., 2024).
Retrospective LLM critics: Frozen, instruction-tuned LLMs that scan an actor-generated trajectory after outcome revelation (e.g., after receiving a gold answer). These assign fine-grained, turn-level feedback using privileged, hindsight information (CriticSearch; (Zhang et al., 15 Nov 2025)).
Discrimination-stage LLM critics: In multi-agent evaluation systems such as RevAgent, a discrimination-stage LLM critic compares multiple candidate outputs to select the most informative or relevant, operating as a meta-evaluator (Li et al., 1 Nov 2025).
Step-wise diagnostic critics: In GUI agents and code review, critics act at the level of individual steps, evaluating candidate operations or comments before execution to block, flag, or refine unsafe or incorrect actions (Wu et al., 18 Dec 2025, Li et al., 1 Nov 2025).
Co-evolving or on-policy critics: RL-based critic agents that update their parameters synchronously with the policy, ensuring adaptivity as the actor's capabilities and failure modes evolve over time (ECHO; (Li et al., 11 Jan 2026)).

A common feature is that critics are often asymmetric: they wield access to information not available to the actor (e.g., the gold answer, system state, templates, or expert paths), or operate retrospectively, with the luxury of hindsight inaccessible to the online agent (Zhang et al., 15 Nov 2025).

2. Mathematical Mechanisms for Credit Assignment

Modern critic agents supply high-granularity supervision that can be scalar, categorical, or natural-language, depending on the domain and methodology:

Dense, turn-level supervision: CriticSearch augments standard sparse outcome rewards with dense, per-turn signals generated by the critic LLM, labeling each action as Good or Bad ( $\ell_{i,t}\in\{\text{Good},\text{Bad}\}$ ), which map to normalized turn-level rewards and advantages (Zhang et al., 15 Nov 2025).
Hybrid advantage formulation: A weighted combination of global trajectory reward and fine-grained critic output achieves faster, more stable convergence, with the trade-off modulated by a coefficient $\alpha$ (Zhang et al., 15 Nov 2025).
Template-based error localization: Table-Critic's critic uses a template-matching strategy to match error patterns and localize the first erroneous reasoning step, outputting a textual critique precisely targeted at the error locus (Yu et al., 17 Feb 2025).
Supervised categorical selection: RevAgent's critic agent, instruction-tuned via LoRA, classifies candidate review comments by issue category and correctness, learning from contrastive examples across categories (Li et al., 1 Nov 2025).
Advantage and gradient signal shaping: Multi-agent RL critics provide the baseline for actor updates by estimating centralized ( $Q^{\text{c}}$ ), decentralized ( $Q^{\text{h}}$ ), or permutation-invariant value functions, each with distinct bias-variance properties under partial observability (Lyu et al., 2024, Liu et al., 2019).
Natural-language and structured feedback: Language-critic agents in CGI and ECHO frameworks output structured, multi-criteria critiques (contribution, feasibility, efficiency, and revision) or free-form advice, which the actor subsequently incorporates for refinement (Yang et al., 20 Mar 2025, Li et al., 11 Jan 2026).

3. Applications and Empirical Impact

Critic agents are deployed in a diverse set of domains and architectures:

Domain	Critic Function	Performance Impact
Multi-hop QA	Retrospective LLM critic for turn-level reward	+16.7% EM / +14.2% F1 (relative to baseline)
Multi-agent RL	(Permutation-invariant, centralized/decentralized) critics	15–50% reward gain, 30x scalability (Liu et al., 2019)
Table Reasoning	Template-matching LLM critic, error localization	+8.9% accuracy gain, lower degradation (Yu et al., 17 Feb 2025)
Feature Engineering	Diagnoser/critic LLM for unsupervised, robust transforms	3–5% accuracy gain, 1-2 orders of magnitude faster than search (Gong et al., 30 Apr 2025)
GUI Agents	Step-level VLM critic (OS-Oracle) for error rejection	+3–5% success, surpasses proprietary VLMs on Mobile (Wu et al., 18 Dec 2025)
Regulated Automation	Adversarial critic for insurance HITL vetting	Hallucination rate drops from 11.3%→3.8%, accuracy from 92%→96% (Roy et al., 21 Jan 2026)
Financial QA	LLM critic with confidence gating and arithmetic agent	Outperforms PoT and CoT, max accuracy 82.7% (Tan et al., 10 Jun 2025)

Empirical findings consistently demonstrate that critic agents—especially those providing dense, actionable, or event-localized feedback—yield large gains in sample efficiency, stability, error-correction, and final task metrics, provided the reward shaping and learning signals are properly aligned with the actor's optimization process (Zhang et al., 15 Nov 2025, Li et al., 11 Jan 2026, Yang et al., 20 Mar 2025, Yu et al., 17 Feb 2025).

4. Theoretical Analysis and Bias-Variance Trade-offs

In multi-agent and RL contexts, the architectural choices for critics determine the learning dynamics:

Centralized (state-based) critics: Training on full system state can reduce per-update variance but introduces irreducible bias under partial observability; the gradient estimator is only unbiased if the critic conditions on exactly the same information as the actor (Lyu et al., 2024).
Decentralized (history-based) critics: These yield unbiased gradients but with higher sample variance and greater function approximation difficulty as history length increases (Lyu et al., 2024).
Permutation-invariant critics: Graph-pooling (PIC) critics avoid redundant learning by ensuring consistent value outputs under input permutation, enabling scalability to large $N$ (Liu et al., 2019).
Hybrid strategies: Practical systems often interleave global and local information, attention over agent subgroups, or encode belief-states to balance bias and variance (Lin et al., 2023, Lyu et al., 2024).

A major finding is that critic centralization and informativeness must be carefully balanced: richer observation can lower variance but at the cost of potential bias, especially when the actor's policy is not fully observable to the critic (Dec-POMDP settings) (Lyu et al., 2024).

5. Implementation Paradigms and Training Protocols

Critic agents are trained, deployed, and evaluated through a variety of mechanisms:

Frozen vs. adaptive critics: Retrospective critics for fine-grained reward shaping are often frozen to avoid critic-polic instability, whereas open-world RL scenarios benefit from adaptive, co-evolving critics (ECHO) that ensure feedback remains tailored as the actor's behavior changes (Zhang et al., 15 Nov 2025, Li et al., 11 Jan 2026).
Prompt-driven and LLM-based critics: Direct LLM prompts, sometimes with few-shot in-context exemplars, underpin critic actions in table reasoning, feature selection, code review, and GUI interaction contexts (Yu et al., 17 Feb 2025, Gong et al., 30 Apr 2025, Wu et al., 18 Dec 2025, Li et al., 1 Nov 2025).
Contrastive and retrieval-based training: Instruction-tuning critics with contrastive examples, as in RevAgent, help the critic develop sharper category boundaries and selectivity (Li et al., 1 Nov 2025).
Consistency and reasoning-alignment objectives: Cross-entropy, LoRA fine-tuning, consistency-preserving GRPO, and reward shaping (e.g., saturation-aware gains, rationale-judgment alignment) are critical for ensuring that critics provide coherent, actionable signals (Wu et al., 18 Dec 2025, Li et al., 11 Jan 2026).
Fine-tuning and RL surrogates: Actor-critic policy updates leverage advantages (hybrid, turn-level, group-relative) that integrate critic outputs, often through PPO-style clipped objectives, KL constraints, or actor–critic co-training (Zhang et al., 15 Nov 2025, Obando-Ceron et al., 15 Oct 2025, Li et al., 11 Jan 2026).

Notably, the "critic agent" concept generalizes beyond neural network value estimators to encompass prompt-driven, modular, and even human-in-the-loop discrimination and feedback systems (Gong et al., 30 Apr 2025, Yu et al., 17 Feb 2025).

6. Limitations, Open Questions, and Future Extensions

While critic agents confer numerous advantages, several challenges and extension paths remain:

Dependence on privileged information: Retrospective and template-matching critics often require access to gold answers or curated templates during training, limiting applicability to settings without such supervision (Zhang et al., 15 Nov 2025).
Scalability and computational cost: Complex critic models, especially those involving LLMs or group-batch evaluation, can exceed actors in memory and inference time, restricting deployment in real-time or resource-constrained settings (Li et al., 1 Nov 2025, Wu et al., 18 Dec 2025).
Learning and reasoning bottlenecks: Critics relying on in-context exemplars or static templates may be brittle to out-of-distribution failure patterns, motivating co-evolving adaptive critics or self-supervised error modeling (Li et al., 11 Jan 2026, Yu et al., 17 Feb 2025).
Integration with function approximation constraints: For continuous-control RL, stability relies on embedding design (e.g., simplicial embeddings) to avoid dead units and ensure robust gradient flow, suggesting that critic architecture must be holistically tuned alongside actor representations (Obando-Ceron et al., 15 Oct 2025).
Direct actor–critic synergy: Recent methods leverage actor self-critique, critic-conditioned refinement, and iterative synergy to amplify the learning signal and encourage strategic exploration, but the best protocols for stability and performance are still under study (Yang et al., 20 Mar 2025, Li et al., 11 Jan 2026, Zhang et al., 15 Nov 2025).
Generalization to new modalities and tasks: Extensions include adaptation to code execution agents, hierarchical and multi-agent critics, and modularized human–AI collaborative discrimination phases (Zhang et al., 15 Nov 2025, Gong et al., 30 Apr 2025).

Overall, critic agents define a critical foundation in the evolution of robust, interpretable, and sample-efficient agentic systems, with continuing innovation at the interfaces of RL, language modeling, multi-agent coordination, and domain-specific tool use (Zhang et al., 15 Nov 2025, Li et al., 11 Jan 2026, Lyu et al., 2024, Yu et al., 17 Feb 2025, Wu et al., 18 Dec 2025).