Papers
Topics
Authors
Recent
Search
2000 character limit reached

CriticBench: Comprehensive GQC Benchmark

Updated 23 February 2026
  • CriticBench is a benchmark framework assessing LLMs' abilities to generate, critique, and correct outputs, enabling systematic self-improvement.
  • It utilizes 3,825 evaluation instances across five domains with automated multi-stage scoring metrics such as accuracy and critique F₁ score.
  • Empirical findings reveal strong correlations among generation, critique, and correction skills and highlight the benefits of explicit critique training and inter-model evaluations.

CriticBench is a comprehensive benchmark framework for systematic evaluation of LLMs' (LLMs) abilities to generate, critique, and correct their outputs—collectively referred to as “GQC” reasoning. It is designed to quantify nuanced self-improvement and oversight skills in LLMs across multiple domains, task types, and model classes. CriticBench encompasses a diverse suite of tasks, rigorous automated evaluation protocols, and reveals foundational insights into the relationships among generation, critique, and correction abilities within and across LLM architectures (Lin et al., 2024).

1. Scope and Motivation

The central motivation for CriticBench derives from the increasing demand for LLMs not merely to generate plausible outputs, but to reliably audit, critique, and revise their own reasoning. This is critical for their deployment in high-stakes roles such as automated evaluation, feedback provision, and self-improvement pipelines. Previous benchmarks have either focused on narrow data slices or reported inconsistent findings regarding LLMs' metacognitive capacities. CriticBench addresses open research questions:

  • Which factors (model size, training regime, prompt format, oracle feedback) most influence GQC performance?
  • Are generation, critique, and correction skills correlated or disjoint?
  • How does task type mediate critique and correction efficacy?
  • Do internal “knowledge states” remain consistent across GQC stages?
  • How does inter-model critique compare to self-critique capability?

These questions are operationalized via explicit GQC definitions:

  • Generation (G): Produce an initial answer to a query, typically under chain-of-thought (CoT) prompting.
  • Critique (Q): Judge an answer as correct/incorrect in context.
  • Correction (C): Attempt to improve an answer, conditional on its critique and possibly additional feedback (Lin et al., 2024).

2. Domain and Dataset Composition

CriticBench is constructed to ensure broad coverage in both logical and detail-oriented tasks. It draws 3,825 evaluation instances from 15 established datasets, partitioned into five principal domains:

Domain Example Datasets Representative Task Types
Mathematical Reasoning GSM8K, MATH, AQuA, TabMWP Word problems, algebra, arithmetic, tables
Commonsense Reasoning CommonsenseQA, AmbigNQ Multiple-choice, ambiguity, fact synthesis
Symbolic Reasoning Penguins, Colored Object Table lookup, spatial, calendar arithmetic
Code Generation MBPP, HumanEval Pythonic synthesis, unit-tested solutions
Algorithmic Object Counting, Repeat Copy Patterned counting, string transformation

The datasets are selected to require distinct forms of inference and attention to detail. For example, symbolic reasoning (Colored Object, Date) contrasts with detail-centric tasks (Object Counting), and coding benchmarks assess functional correctness and error correction (Lin et al., 2024).

3. Evaluation Methodologies

CriticBench introduces automated, multi-stage scoring procedures to rigorously quantify each aspect of GQC reasoning:

  • Accuracy (SaS_a): Measures fraction of exactly correct outputs for both initial generations and corrections:

Sa=cNS_a = \frac{c}{N}

where cc is the count of correct outputs out of NN queries.

  • Critique F₁ Score (SfS_f):
    • Measures the harmonic mean of precision and recall for judging incorrect answers, mitigating class imbalance in error detection.
    • Definitions:

    Sp=1mi=1mqi,Sr=1ni=1nqi,Sf=2SpSrSp+SrS_p = \frac{1}{m}\sum_{i=1}^m q_i, \quad S_r = \frac{1}{n}\sum_{i=1}^n q_i, \quad S_f = 2\frac{S_p S_r}{S_p + S_r}

    where mm = # flagged as wrong, nn = # truly wrong, qi=1q_i=1 if flag is correct, else $0$.

  • Protocol:

  1. Generate responses with greedy CoT.
  2. Critique with zero/few-shot or CoT prompts.
  3. Correct using critique-conditional prompts (including an “oracle” mode).
  4. Automatically judge critique labels and corrected answers.

This protocol provides a controlled, reproducible comparison of GQC performance for each evaluated system (Lin et al., 2024).

4. Models and Experimental Design

CriticBench benchmarks 17 LLMs, spanning both closed- and open-source as well as models specifically trained for critique:

  • Closed-source: GPT-3.5-turbo, GPT-4 (RLHF-aligned)

  • Open-source: Phi-2 (2.7B), LLaMA-2 (7B, 13B, 70B), Vicuna (7B–33B), Mistral (7B), Mixtral-8×7B (SIFT and BASE)

  • Critique-supervised: Auto-J-13B, UltraCM-13B

Evaluation includes diverse prompting formats (zero-shot, CoT, four-shot, answer-only vs. rationale), and incorporates inter-model critique, resulting in a 17×1717 \times 17 matrix where each model critiques each other's outputs, as well as its own (Lin et al., 2024).

5. Key Empirical Findings

CriticBench yields several fundamental and sometimes unexpected empirical results:

  • Linear GQC Correlation: Generation (G), Critique (Q), and Correction (C) scores are strongly and nearly linearly correlated across most models. Notably, the slope 1\approx1 for Q/G indicates critique abilities are inherited alongside basic generation, even absent explicit critique supervision.

  • Effects of Critique Training: Models explicitly trained for critique (Auto-J, UltraCM-13B) significantly outperform standard models in error-flagging (Q), exceeding GPT-3.5 by >10 >10~F₁ points.

  • Task Dependency: Correction effectiveness is highly task-dependent. Logic-dominated tasks (symbolic, code) are readily corrected—corrections often surpass the original generations. In contrast, detail-focused tasks (object counting, date arithmetic) show negligible correction gains, even with high G performance.

  • GQC Incoherence: Venn analysis reveals “Q only” regions—cases where the model knows an answer is wrong but cannot generate or fix it. The proportion of such incoherent cases diminishes as model size increases, suggesting partial convergence toward unified knowledge states.

  • Inter-Model Critique Dynamics: Stronger models most reliably flag errors in weaker models’ outputs. However, some weaker models (notably Vicuna-7B) can, counterintuitively, critique stronger models' responses better than those models can critique themselves. Critique-tuned models generalize their performance benefits to judging outputs from other models (Lin et al., 2024).

6. Implications, Limitations, and Extensions

The design and findings of CriticBench have broad implications for LLM research and deployment:

  • Significance for LLM Ecosystems: The GQC framing and evaluation suite promote the development of models capable of not just production, but introspective self-auditing and refinement—necessary for robust real-world assistants (code assistants, fact-checkers, tutoring systems).

  • Benchmark Limitations:

    • The use of binary critique labels may obscure partial or severity-graded error detection.
    • Leveraging GPT-4 and human review for gold labels introduces potential bias.
    • Coverage is restricted to five main reasoning domains without open-ended language tasks or multi-modal inputs.
  • Directions for Future Work:
    • Develop fine-grained critique metrics (e.g., error localization, severity).
    • Expand to additional modalities (dialogue, summarization, multi-modal reasoning).
    • Investigate multi-step, iterative GQC loops for compounding self-improvement (Lin et al., 2024).

A plausible implication is that comprehensive critique and self-correction skills are emergent properties of increasing scale and explicit supervision, but that certain detailed or low-level procedural errors remain resistant to correction even in large models.

7. Relationship to Other Critique Benchmarks

CriticBench represents a significant advance beyond prior work by providing the first large-scale, multi-stage benchmark for GQC reasoning across diverse and challenging datasets. It is complementary to, and precedes, code-specific critique suites (e.g., CodeCriticBench (Zhang et al., 23 Feb 2025)) and multi-modal critique evaluations (MM-CRITIC (Zeng et al., 12 Nov 2025)). Whereas previous efforts typically focused on either critique accuracy for limited domains or a single stage of the reasoning process, CriticBench uniquely facilitates detailed analysis of the interplay and independence of LLM generation, critique, and correction behaviors.

Subsequent benchmarks and toolkits have built on CriticBench's methodological innovations—explicitly evaluating critique across dimensions (basic, correction, comparison, meta), employing reference-anchored human validation, and extending to new modalities and agent settings (Lan et al., 2024, Zheng et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CriticBench.