LLM Critics Help Catch LLM Bugs (2407.00215v1)

Published 28 Jun 2024 in cs.SE and cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

PDF HTML Abstract

An Examination of LLM Critics for Automating Bug Detection in Code

The paper presents an exploration of the application of critics, themselves LLMs, in aiding humans to more accurately identify errors in model-written code, particularly focusing on the limitations posed by Reinforcement Learning from Human Feedback (RLHF). The core idea is that while RLHF is a predominant training method, it is inherently restricted by human evaluative capacity as model capabilities increase. To address this, the authors introduce the concept of training LLMs that critique model outputs, thereby enhancing humans' ability to perform precise evaluations.

Key Contributions and Findings

Critique Models Training and Performance: The research trains a new model, termed CriticGPT, using RLHF to perform real-world code critique tasks. It is revealed that these LLM critics outperform human reviewers in terms of identifying bugs in code snippets, with CriticGPT's critiques being preferred over human-written critiques in 63% of evaluated cases.
Empirical Evaluation Against Human Critique: Through a structured comparison that involves contractors inserting bugs into code and evaluating model and human-generated critiques, the research concludes that the LLM-driven critiques surpass the human equivalents in terms of both preference and bug identification.
Trade-off Between Comprehensiveness and Hallucinations: A notable exploration is done on the trade-off between the breadth of critiques (comprehensiveness in identifying all possible errors) and the accuracy (avoiding spurious or 'hallucinated' bugs). The authors propose the Force Sampling Beam Search (FSBS) technique to navigate this trade-off effectively. FSBS manages critique length and selection to maintain high critique relevance while optimizing the number of genuine problems identified.
Generalizability Beyond Code: Beyond showing effectiveness in code-related tasks, the critics are demonstrated to generalize to non-code tasks, identifying flaws in areas ChatGPT labeled as flawless initially. This indicates the potential robustness of the critique model across various domains of LLM output, although a majority of the critique applications were directed toward code.

Implications and Future Directions

The implications of this research are multifaceted, both practically and theoretically. Practically, deploying LLM critics in environments where code is a dominant utility can increase productivity and decrease risk of deploying flawed code. Theoretically, this work contributes to the ongoing discourse in scalable oversight of AI systems, underscoring the need for effective evaluation mechanisms as models become exceedingly complex.

Reflecting on future directions, a direct progression could involve integrating critique models into broader AI development processes, where critiques can iteratively influence training regimes beyond RLHF. Additionally, advancing FSBS or similar techniques for better balancing precision and coverage in critiques could significantly enhance model utility.

Finally, while this paper does not herald these findings as revolutionary, recognizing the iterative improvement depicted here positions LLM critics as a substantial step toward addressing RLHF limitations and advancing AI alignment methods. As models become more capable, ensuring scalable oversight mechanisms remains a critical challenge, and this work highlights a potential solution pathway through advanced critique systems.