The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation (2408.08688v4)

Published 16 Aug 2024 in cs.CL and cs.AI

Abstract: This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-agent workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from LLMs are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the response generation module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-agent configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline.

PDF HTML Abstract

The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation

This paper presents an advanced methodology for generating synthetic Preference Optimization (PO) datasets using multi-agent workflows, focusing particularly on the automation and enhancement of both the response evaluation and generation processes. The research aims to alleviate the labor-intensive and often biased nature of human-annotated PO dataset creation.

Overview and Methodology

The research introduces a dual-module approach for PO dataset generation: the response evaluation module and the response generation module. The response evaluation module explores the efficacy of LLMs as evaluators, comparing three key prompting strategies: Direct Comparison, Independent Scoring, and Combined Scoring. Within these strategies, three different evaluation configurations are assessed: LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. The quality of these configurations was evaluated using inter-rater agreement measured by Cohen's Kappa on datasets such as Alpaca Eval, FairEval, PandaLM-Eval, and MT-Bench.

In the response generation module, the multi-agent framework operates through an LLM Feedback Loop involving a generator LLM and a feedback LLM to iteratively refine responses. Multiple configurations were tested, covering different roles for Llama and Gemma models. The primary metric for evaluating these configurations was the win rate, indicating the fraction of times a generation framework outperformed a single-agent baseline.

Empirical Findings

LLM-as-a-Judge

The LLM-as-a-Judge configuration provided substantial insights:

Direct Comparison and Combined Scoring consistently outperformed Independent Scoring, with the latter showing lower Cohen’s Kappa values.
Among Direct Comparison and Combined Scoring strategies, the scoring-out-of-10 scale for Combined Scoring displayed superior results.
GPT-4o outperformed other LLMs across various datasets, underscoring its reliable evaluation capabilities but also revealing potential biases when candidate responses came from its family, as evidenced in the FairEval dataset.

LLMs-as-a-Jury

The LLMs-as-a-Jury configuration presented mixed results:

The jury approach surpassed the single-judge configuration on the FairEval and MT-Bench datasets but showed variability across different datasets.
Juries mitigated biases seen in single-agent evaluations, particularly those involving responses from the judge’s own model family.
The top-performing juries included a mix of Gemma and GPT models, suggesting that diversity in LLMs contributes to more robust evaluation outcomes.

LLM Debate

The LLM Debate configuration demonstrated:

An increased Cohen's Kappa score for GPT-4o on FairEval when using the Debate framework, showcasing improved impartiality.
Variable performance across datasets, with no consistent superiority over the jury or single-judge frameworks.
Higher resource demands and more complex management compared to single-agent setups.

Response Generation and Multi-Agent Configuration

In the LLM Feedback Loop, configurations with Llama-3.1-8b as the generator and Gemma-2-9b as the reviewer performed optimally, achieving a significant win rate advantage over single-agent baselines:

Llama as the generator with Gemma as a reviewer showed a 49.0% win rate against GPT-4o, 71.8% against single-agent Llama, and 73.8% against single-agent Gemma.
Results indicated that leveraging the complementary strengths of different models enhances the quality of generated responses, validating the efficacy of multi-agent frameworks.

Practical and Theoretical Implications

This paper's findings have profound implications for the future of AI development:

Practical Implications: Automating PO dataset generation significantly reduces reliance on human annotators, streamlining the creation of large-scale, high-quality datasets critical for fine-tuning and improving LLMs.
Theoretical Implications: The comparative analysis of different evaluation frameworks contributes valuable knowledge towards understanding model bias, consistency, and effectiveness in AI. This paves the way for more impartial, robust, and scalable AI systems.

Conclusion and Future Work

The research concludes with the successful generation of multiple PO datasets aimed at improving individual and multi-agent LLM capabilities. The use of GPT-4o as a judge and the Llama-Gemma feedback loop configuration emerged as effective strategies. Future work could explore:

Model Performance Post-Fine-Tuning: Assessing the impact of the generated datasets on models fine-tuned with these datasets.
Larger Scale Models: Employing larger models like Llama-3.1-70b and Gemma-2-27b for dataset generation.
Iteration Experiments: Analyzing the number of iterations in the feedback loop to optimize the framework further.

In summary, this research lays a foundation for more advanced, automated frameworks in generating and evaluating PO datasets, enhancing both practical applications and theoretical understanding in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Samee Arif (7 papers)
Sualeha Farid (4 papers)
Abdul Hameed Azeemi (7 papers)
Awais Athar (9 papers)
Agha Ali Raza (12 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1825661284282520009