RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques (2501.14492v1)

Published 24 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Critiques are important for enhancing the performance of LLMs, enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.

PDF Abstract

Evaluating Critique Capabilities of LLMs: An Analysis of RealCritic Framework

The paper introduces RealCritic, a benchmark designed to evaluate the critique capabilities of LLMs, with a focus on their effectiveness in reasoning tasks. As LLMs demonstrate notable performance across various domains, assessing their ability to self-evaluate and improve through critiques becomes essential. The RealCritic framework addresses the challenge of evaluating open-ended critique tasks by linking critique quality to its impact on solution refinement, employing a closed-loop evaluation methodology.

The paper highlights three evaluation paradigms: self-critique, cross-critique, and iterative critique. Self-critique involves models analyzing and correcting their own outputs, providing insight into their introspective capabilities. Cross-critique evaluates a model's ability to critique outputs from different models, emphasizing adaptability and versatility. Iterative critique examines the long-horizon reasoning capabilities by allowing multiple rounds of critique and correction.

RealCritic is operationalized using eight challenging datasets covering open-ended mathematical reasoning and general-domain multiple-choice questions. These include datasets like GSM8K and Olympiad Bench, which span a range of difficulties and topics, ensuring comprehensive critique evaluation. Notably, the benchmark supports multi-turn sessions, enabling detailed analysis through successive iterations.

The experimental results indicate that while most classical LLMs show limited improvements in self-critique and sometimes degrade in performance, advanced reasoning-based models such as o1-mini demonstrate significant gains in critique scenarios. Self-critique performance, notably, saw improvements up to 25.85% on certain tasks, highlighting notable introspective capabilities. On cross-critique tasks, all models exhibit substantial improvements over the baseline, with o1-mini leading by achieving up to 40% accuracy gain in certain tasks compared to initially random outputs.

Overall, the paper presents a robust framework for evaluating LLMs' critique abilities, providing valuable insights into models' error-detection and correction mechanisms. The paper underscores the importance of critique-based evaluation as a means to identify pathways for further enhancing the capabilities of LLMs. Future advancements could leverage insights from such closed-loop evaluations to refine architectures and training methodologies, potentially leading to models with stronger introspective and error-correction capabilities. By focusing on critique effectiveness, RealCritic lays the foundation for developing more reliable and self-improving LLMs, crucial for pushing the boundaries of AI in complex reasoning domains.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zhengyang Tang (13 papers)
Ziniu Li (24 papers)
Zhenyang Xiao (9 papers)
Tian Ding (20 papers)
Ruoyu Sun (70 papers)
Benyou Wang (109 papers)
Dayiheng Liu (75 papers)
Fei Huang (408 papers)
Tianyu Liu (177 papers)
Bowen Yu (89 papers)
Junyang Lin (99 papers)

Related Papers

Find Related Papers

GitHub

GitHub - tangzhy/RealCritic (2 stars)