The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
This paper presents an advanced methodology for generating synthetic Preference Optimization (PO) datasets using multi-agent workflows, focusing particularly on the automation and enhancement of both the response evaluation and generation processes. The research aims to alleviate the labor-intensive and often biased nature of human-annotated PO dataset creation.
Overview and Methodology
The research introduces a dual-module approach for PO dataset generation: the response evaluation module and the response generation module. The response evaluation module explores the efficacy of LLMs as evaluators, comparing three key prompting strategies: Direct Comparison, Independent Scoring, and Combined Scoring. Within these strategies, three different evaluation configurations are assessed: LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. The quality of these configurations was evaluated using inter-rater agreement measured by Cohen's Kappa on datasets such as Alpaca Eval, FairEval, PandaLM-Eval, and MT-Bench.
In the response generation module, the multi-agent framework operates through an LLM Feedback Loop involving a generator LLM and a feedback LLM to iteratively refine responses. Multiple configurations were tested, covering different roles for Llama and Gemma models. The primary metric for evaluating these configurations was the win rate, indicating the fraction of times a generation framework outperformed a single-agent baseline.
Empirical Findings
LLM-as-a-Judge
The LLM-as-a-Judge configuration provided substantial insights:
- Direct Comparison and Combined Scoring consistently outperformed Independent Scoring, with the latter showing lower Cohen’s Kappa values.
- Among Direct Comparison and Combined Scoring strategies, the scoring-out-of-10 scale for Combined Scoring displayed superior results.
- GPT-4o outperformed other LLMs across various datasets, underscoring its reliable evaluation capabilities but also revealing potential biases when candidate responses came from its family, as evidenced in the FairEval dataset.
LLMs-as-a-Jury
The LLMs-as-a-Jury configuration presented mixed results:
- The jury approach surpassed the single-judge configuration on the FairEval and MT-Bench datasets but showed variability across different datasets.
- Juries mitigated biases seen in single-agent evaluations, particularly those involving responses from the judge’s own model family.
- The top-performing juries included a mix of Gemma and GPT models, suggesting that diversity in LLMs contributes to more robust evaluation outcomes.
LLM Debate
The LLM Debate configuration demonstrated:
- An increased Cohen's Kappa score for GPT-4o on FairEval when using the Debate framework, showcasing improved impartiality.
- Variable performance across datasets, with no consistent superiority over the jury or single-judge frameworks.
- Higher resource demands and more complex management compared to single-agent setups.
Response Generation and Multi-Agent Configuration
In the LLM Feedback Loop, configurations with Llama-3.1-8b as the generator and Gemma-2-9b as the reviewer performed optimally, achieving a significant win rate advantage over single-agent baselines:
- Llama as the generator with Gemma as a reviewer showed a 49.0% win rate against GPT-4o, 71.8% against single-agent Llama, and 73.8% against single-agent Gemma.
- Results indicated that leveraging the complementary strengths of different models enhances the quality of generated responses, validating the efficacy of multi-agent frameworks.
Practical and Theoretical Implications
This paper's findings have profound implications for the future of AI development:
- Practical Implications: Automating PO dataset generation significantly reduces reliance on human annotators, streamlining the creation of large-scale, high-quality datasets critical for fine-tuning and improving LLMs.
- Theoretical Implications: The comparative analysis of different evaluation frameworks contributes valuable knowledge towards understanding model bias, consistency, and effectiveness in AI. This paves the way for more impartial, robust, and scalable AI systems.
Conclusion and Future Work
The research concludes with the successful generation of multiple PO datasets aimed at improving individual and multi-agent LLM capabilities. The use of GPT-4o as a judge and the Llama-Gemma feedback loop configuration emerged as effective strategies. Future work could explore:
- Model Performance Post-Fine-Tuning: Assessing the impact of the generated datasets on models fine-tuned with these datasets.
- Larger Scale Models: Employing larger models like Llama-3.1-70b and Gemma-2-27b for dataset generation.
- Iteration Experiments: Analyzing the number of iterations in the feedback loop to optimize the framework further.
In summary, this research lays a foundation for more advanced, automated frameworks in generating and evaluating PO datasets, enhancing both practical applications and theoretical understanding in AI.