CriticBench: Comprehensive GQC Benchmark

Updated 23 February 2026

CriticBench is a benchmark framework assessing LLMs' abilities to generate, critique, and correct outputs, enabling systematic self-improvement.
It utilizes 3,825 evaluation instances across five domains with automated multi-stage scoring metrics such as accuracy and critique F₁ score.
Empirical findings reveal strong correlations among generation, critique, and correction skills and highlight the benefits of explicit critique training and inter-model evaluations.

CriticBench is a comprehensive benchmark framework for systematic evaluation of LLMs' (LLMs) abilities to generate, critique, and correct their outputs—collectively referred to as “GQC” reasoning. It is designed to quantify nuanced self-improvement and oversight skills in LLMs across multiple domains, task types, and model classes. CriticBench encompasses a diverse suite of tasks, rigorous automated evaluation protocols, and reveals foundational insights into the relationships among generation, critique, and correction abilities within and across LLM architectures (Lin et al., 2024).

1. Scope and Motivation

The central motivation for CriticBench derives from the increasing demand for LLMs not merely to generate plausible outputs, but to reliably audit, critique, and revise their own reasoning. This is critical for their deployment in high-stakes roles such as automated evaluation, feedback provision, and self-improvement pipelines. Previous benchmarks have either focused on narrow data slices or reported inconsistent findings regarding LLMs' metacognitive capacities. CriticBench addresses open research questions:

Which factors (model size, training regime, prompt format, oracle feedback) most influence GQC performance?
Are generation, critique, and correction skills correlated or disjoint?
How does task type mediate critique and correction efficacy?
Do internal “knowledge states” remain consistent across GQC stages?
How does inter-model critique compare to self-critique capability?

These questions are operationalized via explicit GQC definitions:

Generation (G): Produce an initial answer to a query, typically under chain-of-thought (CoT) prompting.
Critique (Q): Judge an answer as correct/incorrect in context.
Correction (C): Attempt to improve an answer, conditional on its critique and possibly additional feedback (Lin et al., 2024).

2. Domain and Dataset Composition

CriticBench is constructed to ensure broad coverage in both logical and detail-oriented tasks. It draws 3,825 evaluation instances from 15 established datasets, partitioned into five principal domains:

Domain	Example Datasets	Representative Task Types
Mathematical Reasoning	GSM8K, MATH, AQuA, TabMWP	Word problems, algebra, arithmetic, tables
Commonsense Reasoning	CommonsenseQA, AmbigNQ	Multiple-choice, ambiguity, fact synthesis
Symbolic Reasoning	Penguins, Colored Object	Table lookup, spatial, calendar arithmetic
Code Generation	MBPP, HumanEval	Pythonic synthesis, unit-tested solutions
Algorithmic	Object Counting, Repeat Copy	Patterned counting, string transformation

The datasets are selected to require distinct forms of inference and attention to detail. For example, symbolic reasoning (Colored Object, Date) contrasts with detail-centric tasks (Object Counting), and coding benchmarks assess functional correctness and error correction (Lin et al., 2024).

3. Evaluation Methodologies

CriticBench introduces automated, multi-stage scoring procedures to rigorously quantify each aspect of GQC reasoning:

Accuracy ( $S_a$ ): Measures fraction of exactly correct outputs for both initial generations and corrections:

$S_a = \frac{c}{N}$

where $c$ is the count of correct outputs out of $N$ queries.

Critique F₁ Score ( $S_f$ ):
- Measures the harmonic mean of precision and recall for judging incorrect answers, mitigating class imbalance in error detection.
- Definitions:
$S_p = \frac{1}{m}\sum_{i=1}^m q_i, \quad S_r = \frac{1}{n}\sum_{i=1}^n q_i, \quad S_f = 2\frac{S_p S_r}{S_p + S_r}$

where $m$ = # flagged as wrong, $n$ = # truly wrong, $q_i=1$ if flag is correct, else $0$.
Protocol:

Generate responses with greedy CoT.
Critique with zero/few-shot or CoT prompts.
Correct using critique-conditional prompts (including an “oracle” mode).
Automatically judge critique labels and corrected answers.

This protocol provides a controlled, reproducible comparison of GQC performance for each evaluated system (Lin et al., 2024).

4. Models and Experimental Design

CriticBench benchmarks 17 LLMs, spanning both closed- and open-source as well as models specifically trained for critique:

Closed-source: GPT-3.5-turbo, GPT-4 (RLHF-aligned)
Open-source: Phi-2 (2.7B), LLaMA-2 (7B, 13B, 70B), Vicuna (7B–33B), Mistral (7B), Mixtral-8×7B (SIFT and BASE)
Critique-supervised: Auto-J-13B, UltraCM-13B

Evaluation includes diverse prompting formats (zero-shot, CoT, four-shot, answer-only vs. rationale), and incorporates inter-model critique, resulting in a $S_a = \frac{c}{N}$ 0 matrix where each model critiques each other's outputs, as well as its own (Lin et al., 2024).

5. Key Empirical Findings

CriticBench yields several fundamental and sometimes unexpected empirical results:

Linear GQC Correlation: Generation (G), Critique (Q), and Correction (C) scores are strongly and nearly linearly correlated across most models. Notably, the slope $S_a = \frac{c}{N}$ 1 for Q/G indicates critique abilities are inherited alongside basic generation, even absent explicit critique supervision.
Effects of Critique Training: Models explicitly trained for critique (Auto-J, UltraCM-13B) significantly outperform standard models in error-flagging (Q), exceeding GPT-3.5 by $S_a = \frac{c}{N}$ 2F₁ points.
Task Dependency: Correction effectiveness is highly task-dependent. Logic-dominated tasks (symbolic, code) are readily corrected—corrections often surpass the original generations. In contrast, detail-focused tasks (object counting, date arithmetic) show negligible correction gains, even with high G performance.
GQC Incoherence: Venn analysis reveals “Q only” regions—cases where the model knows an answer is wrong but cannot generate or fix it. The proportion of such incoherent cases diminishes as model size increases, suggesting partial convergence toward unified knowledge states.
Inter-Model Critique Dynamics: Stronger models most reliably flag errors in weaker models’ outputs. However, some weaker models (notably Vicuna-7B) can, counterintuitively, critique stronger models' responses better than those models can critique themselves. Critique-tuned models generalize their performance benefits to judging outputs from other models (Lin et al., 2024).

6. Implications, Limitations, and Extensions

The design and findings of CriticBench have broad implications for LLM research and deployment:

Significance for LLM Ecosystems: The GQC framing and evaluation suite promote the development of models capable of not just production, but introspective self-auditing and refinement—necessary for robust real-world assistants (code assistants, fact-checkers, tutoring systems).
Benchmark Limitations:
- The use of binary critique labels may obscure partial or severity-graded error detection.
- Leveraging GPT-4 and human review for gold labels introduces potential bias.
- Coverage is restricted to five main reasoning domains without open-ended language tasks or multi-modal inputs.
Directions for Future Work:
- Develop fine-grained critique metrics (e.g., error localization, severity).
- Expand to additional modalities (dialogue, summarization, multi-modal reasoning).
- Investigate multi-step, iterative GQC loops for compounding self-improvement (Lin et al., 2024).

A plausible implication is that comprehensive critique and self-correction skills are emergent properties of increasing scale and explicit supervision, but that certain detailed or low-level procedural errors remain resistant to correction even in large models.

7. Relationship to Other Critique Benchmarks

CriticBench represents a significant advance beyond prior work by providing the first large-scale, multi-stage benchmark for GQC reasoning across diverse and challenging datasets. It is complementary to, and precedes, code-specific critique suites (e.g., CodeCriticBench (Zhang et al., 23 Feb 2025)) and multi-modal critique evaluations (MM-CRITIC (Zeng et al., 12 Nov 2025)). Whereas previous efforts typically focused on either critique accuracy for limited domains or a single stage of the reasoning process, CriticBench uniquely facilitates detailed analysis of the interplay and independence of LLM generation, critique, and correction behaviors.

Subsequent benchmarks and toolkits have built on CriticBench's methodological innovations—explicitly evaluating critique across dimensions (basic, correction, comparison, meta), employing reference-anchored human validation, and extending to new modalities and agent settings (Lan et al., 2024, Zheng et al., 2024).

Markdown Report Issue Upgrade to Chat

References (5)

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning (2024)

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (2025)

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique (2025)

CriticEval: Evaluating Large Language Model as Critic (2024)

AgentStudio: A Toolkit for Building General Virtual Agents (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CriticBench.

CriticBench: Comprehensive GQC Benchmark

1. Scope and Motivation

2. Domain and Dataset Composition

3. Evaluation Methodologies

4. Models and Experimental Design

5. Key Empirical Findings

6. Implications, Limitations, and Extensions

7. Relationship to Other Critique Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CriticBench: Comprehensive GQC Benchmark

1. Scope and Motivation

2. Domain and Dataset Composition

3. Evaluation Methodologies

4. Models and Experimental Design

5. Key Empirical Findings

6. Implications, Limitations, and Extensions

7. Relationship to Other Critique Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research