Self-critiquing models for assisting human evaluators (2206.05802v2)

Published 12 Jun 2022 in cs.CL and cs.LG

Abstract: We fine-tune LLMs to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.

PDF Abstract

Essay: Evaluation and Self-Improvement in LLMs

The paper "Self-critiquing models for assisting human evaluators," authored by researchers at OpenAI, presents a paper focused on fine-tuning LLMs to produce natural language critiques. These critiques aim to help human evaluators identify flaws in topic-based summaries generated by other models. The research is motivated by the challenge of evaluating complex model outputs, which often require expertise and significant effort from human evaluators. It proposes a scalable oversight mechanism that leverages AI's capability to assist humans through critiques.

Key Findings and Contributions

The researchers outline several key contributions from their work:

Model-Assisted Critiquing: Model-generated critiques were found to enhance human evaluators' ability to identify more flaws. The experiments revealed that critiques aided in detecting around 50% more errors in summaries compared to unaided human evaluators.
Scaling and Critique Quality: Larger models demonstrated better performance in generating helpful critiques. These models were observed to be more adept at critiquing their outputs, indicating an improvement in self-awareness as model size increases.
Improving Model Outputs via Self-Critique: By using the critiques to refine their outputs, models notably improved their summarization quality. This finding underscores the potential for self-improvement within LLMs using internal feedback mechanisms.
Generator-Discriminator-Critique (GDC) Gaps: The paper introduces a framework to assess the performance gaps between a model's ability to generate, discriminate, and critique outputs. It reveals that even large models retain knowledge they struggle to express as effective critiques, suggesting room for growth in critique articulation.
Public Dataset Release: The researchers contribute to the scientific community by releasing datasets used for training and experiments, fostering further advancements in AI critique generation and evaluation.

Implications and Future Directions

This research offers significant practical and theoretical implications. Practically, integrating critique generation into model training can improve oversight and increase the reliability of AI systems, particularly in tasks where human evaluation is arduous. Theoretically, this paper provides insights into self-awareness and error correction in artificial systems, paving the way for more adaptable AI.

Further research is likely to delve into refining critique models to close the GDC gaps identified. Advancements here could lead to AI systems capable of more nuanced self-assessment and correction, enhancing alignment and trustworthiness. Additionally, future studies may explore analogous mechanisms in other domains, such as code generation and open-ended question answering, where critiqueability is particularly challenging.

Conclusion

The paper's exploration of self-critiquing in LLMs marks a crucial step towards automating the evaluation of AI-generated content. By equipping models with the ability to critique effectively and assist human evaluators, the paper contributes to tackling the scalable oversight problem. The findings demonstrate the potential for large models to improve not only through external feedback but also via internally generated critiques, heralding new possibilities for self-improving AI systems.