RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Published 24 Jan 2025 in cs.CL, cs.AI, and cs.LG | (2501.14492v1)

Abstract: Critiques are important for enhancing the performance of LLMs, enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper presents RealCritic, a benchmark that links LLMs' critique quality to improvements in solution refinement through self, cross, and iterative evaluations.
It employs eight challenging datasets, including GSM8K and Olympiad Bench, to rigorously assess models' introspective and correction capabilities.
Experimental findings reveal up to 25.85% improvement in self-critique and a 40% gain in cross-critique accuracy for advanced reasoning models.

Evaluating Critique Capabilities of LLMs: An Analysis of RealCritic Framework

The paper introduces RealCritic, a benchmark designed to evaluate the critique capabilities of LLMs, with a focus on their effectiveness in reasoning tasks. As LLMs demonstrate notable performance across various domains, assessing their ability to self-evaluate and improve through critiques becomes essential. The RealCritic framework addresses the challenge of evaluating open-ended critique tasks by linking critique quality to its impact on solution refinement, employing a closed-loop evaluation methodology.

The study highlights three evaluation paradigms: self-critique, cross-critique, and iterative critique. Self-critique involves models analyzing and correcting their own outputs, providing insight into their introspective capabilities. Cross-critique evaluates a model's ability to critique outputs from different models, emphasizing adaptability and versatility. Iterative critique examines the long-horizon reasoning capabilities by allowing multiple rounds of critique and correction.

RealCritic is operationalized using eight challenging datasets covering open-ended mathematical reasoning and general-domain multiple-choice questions. These include datasets like GSM8K and Olympiad Bench, which span a range of difficulties and topics, ensuring comprehensive critique evaluation. Notably, the benchmark supports multi-turn sessions, enabling detailed analysis through successive iterations.

The experimental results indicate that while most classical LLMs show limited improvements in self-critique and sometimes degrade in performance, advanced reasoning-based models such as o1-mini demonstrate significant gains in critique scenarios. Self-critique performance, notably, saw improvements up to 25.85% on certain tasks, highlighting notable introspective capabilities. On cross-critique tasks, all models exhibit substantial improvements over the baseline, with o1-mini leading by achieving up to 40% accuracy gain in certain tasks compared to initially random outputs.

Overall, the paper presents a robust framework for evaluating LLMs' critique abilities, providing valuable insights into models' error-detection and correction mechanisms. The study underscores the importance of critique-based evaluation as a means to identify pathways for further enhancing the capabilities of LLMs. Future advancements could leverage insights from such closed-loop evaluations to refine architectures and training methodologies, potentially leading to models with stronger introspective and error-correction capabilities. By focusing on critique effectiveness, RealCritic lays the foundation for developing more reliable and self-improving LLMs, crucial for pushing the boundaries of AI in complex reasoning domains.

Markdown Report Issue