Enabling Scalable Oversight via Self-Evolving Critic (2501.05727v1)

Published 10 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Despite their remarkable performance, the development of LLMs faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3\% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT's performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.

PDF Abstract

Enabling Scalable Oversight via Self-Evolving Critic

The development of LLMs poses a significant challenge in the domain of scalable oversight, particularly in providing effective feedback for tasks either difficult for human evaluation or where LLMs surpass human performance. The paper "Enabling Scalable Oversight via Self-Evolving Critic" introduces SCRIT (Self-evolving CRITic), a framework that addresses this challenge by enhancing critique capabilities through self-evolution, without relying on external supervision from humans or stronger models.

The foundation of SCRIT is the development of critique abilities using synthetic data generated by a contrastive-based self-critic system. This system analyzes reference solutions to critique step-by-step solutions and employs a self-validation mechanism to ensure the quality of generated critiques. Implemented on Qwen2.5-72B-Instruct, SCRIT demonstrates significant improvement in task performance on critique-correction and error identification benchmarks, showing up to a 10.3% improvement across various scenarios. The analysis confirms that SCRIT's performance scales positively with both data and model size, outperforming alternative approaches.

The methodology involves two principal steps: contrastive critique and self-validation. The contrastive critique provides the model with a reference solution, which allows deeper understanding of mathematical reasoning necessary for effective critique. The self-validation process ensures that critiques lead to mathematically valid corrections, thus maintaining a high level of internal consistency.

The evaluation results are particularly noteworthy in mathematical reasoning tasks across a wide array of benchmarks, including GSM8K, MATH, and ProcessBench. Specifically, SCRIT transitions from a 39.7% to a 50.0% accuracy improvement on deliberately incorrect solutions, with similar incremental improvements in other test scenarios. When critiquing tasks require not only correction but also error identification, SCRIT achieves a substantial rise in the average F1 score from 37.8% to 45.0%.

A significant insight from this work is the scaling behavior of SCRIT. With increasing amounts of training data and larger model sizes, the framework shows enhanced performance. This scaling capability is crucial, as it suggests the framework's adaptability to increasing complexities in data and tasks—a key requirement for achieving scalable oversight.

Furthermore, the paper provides a detailed investigation into various critic mechanisms. Through controlled experiments, it is evident that the contrastive critic mechanism yields the most effective results, avoiding pitfalls such as rubber-stamping behavior seen in direct critics and bug-injection critics. The choice for contrastive critique is validated by its superior performance and continued potential for improvement as more training data becomes available.

The implementation of self-validation is crucial in maintaining the quality of training data. By filtering ineffective critiques, it enhances the overall efficacy of the training process, as evidenced in the clear degradation observed when self-validation is excluded from experiments.

Implications of this work extend beyond mathematical reasoning tasks. The SCRIT framework holds the potential for application in diverse domains like coding or logical reasoning, where ground-truth can be objectively verified. Moreover, the framework's ability to perform self-validation opens pathways for integration with reinforcement learning, utilizing critique correction as verifiable rewards to drive advanced optimization strategies.

This research significantly enriches the toolkit for developing LLMs with scalable oversight capabilities, focusing on self-sufficiency. The insights and methodologies outlined in this paper not only advance the field of LLM critique capabilities but also propose promising directions for future research in AI safety and reliability.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zhengyang Tang (13 papers)
Ziniu Li (24 papers)
Zhenyang Xiao (9 papers)
Tian Ding (20 papers)
Ruoyu Sun (70 papers)
Benyou Wang (109 papers)
Dayiheng Liu (75 papers)
Fei Huang (408 papers)
Tianyu Liu (177 papers)
Bowen Yu (89 papers)
Junyang Lin (99 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhengyang_42/status/1878666681951883469

https://twitter.com/davidberenstei/status/1878789198624272827

https://twitter.com/susumuota/status/1880768507761103107

https://twitter.com/zhengyang_42/status/1878636189986717885

https://twitter.com/rohanpaul_ai/status/1879478774225600627