Self-Taught Evaluators (2408.02666v2)

Published 5 Aug 2024 in cs.CL and cs.AI

Abstract: Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Citations (9)

View on Semantic Scholar

Summary

The paper presents an iterative self-training mechanism that leverages LLM-generated synthetic data to train evaluators without human annotations.
It employs a rigorous response pair construction strategy that generates preference pairs and refines model judgment through iterative improvements.
Experimental results show enhanced evaluation accuracy on RewardBench, rising from 75.4% to 88.3%, rivaling performance from human-annotated models.

An Analysis of "Self-Taught Evaluators"

The paper "Self-Taught Evaluators" authored by Tianlu Wang, Ilia Kulikov, Olga Golovneva, and colleagues, explores a novel approach to enhancing Model-based evaluation using LLMs without the need for human annotated data. The principal focus of the research lies in creating robust evaluator models—essential for model development, alignment, and iterative self-improvement—by leveraging synthetic data generated by LLMs themselves.

Summary

Model-based evaluation is a critical aspect of LLM development, especially as a reward model for training and an alternative to human evaluation. The standard method to train these evaluators involves collecting extensive human preference judgments over model responses, which can be costly and quickly become outdated. The authors present an innovative approach that aims to improve evaluators by utilizing solely synthetic training data, thus eliminating the need for human annotations.

The proposed methodology is built on an iterative self-training scheme, which involves the following steps:

Generating Synthetic Data: Starting with unlabeled instructions, contrasting model outputs are generated to create preference pairs (one response inferior to the other).
Training LLM-as-a-Judge: The model generates reasoning traces and final judgments for these pairs. This process is repeated iteratively using the model's improved predictions at each step.

The effectiveness of this approach is demonstrated through rigorous experiments, starting from Llama3-70B-Instruct, showing significant improvements on RewardBench with an accuracy boost from 75.4 to 88.3. This performance surpasses commonly used LLM evaluators such as GPT-4 and matches top-performing reward models trained with labeled data.

Methodological Insights

The paper delineates a comprehensive methodological framework, consisting of:

Instruction Selection: Curating a set of challenging and diverse user instructions using categorization by an LLM.
Response Pair Construction: Generating pairs of responses with a preferred and an inferior output using synthetic data generation techniques.
Iterative Training: Enhancing the LLM-as-a-Judge model through successive iterations of training on the synthetic judgments, fine-tuning the model to self-improve its evaluation capabilities.

Experimental Setup and Results

Training Data Sources: Initial training was conducted using instructions from the WildChat dataset. For evaluations, the RewardBench, MT-Bench, and HelpSteer2 datasets were employed.
Iterative Improvement: The model showed consistent improvement across iterations, with each step refining its judgment accuracy.
Numerical Results: The final model achieved an overall score of 88.3 on RewardBench, with a majority vote inference pushing the performance to 88.7. On MT-Bench, the model showed parity with GPT-4, achieving an agreement rate of 78.9%.

Implications and Future Developments

The implications of this research are manifold:

Scalability: By eliminating human annotation, the model significantly reduces the cost and time associated with training robust evaluators.
Adaptability: The synthetic data generation approach allows for rapid adaptation to new tasks or evaluation criteria without the need for fresh human annotations.
Iterative Self-Improvement: The framework supports continuous improvement, ensuring the evaluators remain relevant and accurate as models evolve.

Future developments could focus on:

Extending to Smaller Models: Investigating the applicability and effectiveness of this approach on smaller LLMs.
Single Response Evaluation: Expanding the framework to evaluate single responses, beyond pairwise comparisons.
Efficiency Enhancements: Exploring ways to reduce the computational requirements and inference costs associated with LLM-as-a-Judge models.

Conclusion

The research presents a pragmatic and efficient approach to training evaluator models using synthetic data. The iterative self-training mechanism ensures that evaluators can self-improve, maintaining their relevance and accuracy over time. This methodology not only matches but, in some instances, surpasses the performance of human-annotated models, paving the way for scalable and adaptable model-based evaluations in the future.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1820652973300527246

https://twitter.com/jaseweston/status/1820636540533399807

https://twitter.com/_philschmid/status/1839947278670246030

https://twitter.com/omarsar0/status/1820849115607044401

https://twitter.com/_akhaliq/status/1820651306958176675

https://twitter.com/rohanpaul_ai/status/1840520650609348803

YouTube

Show All Videos