- The paper identifies preference leakage as a bias arising when LLMs used for data generation also serve as evaluators, leading to preferential treatment of related models.
- It demonstrates through experiments on benchmarks like Arena-Hard that model relatedness, particularly with supervised fine-tuning and larger model sizes, intensifies bias.
- The study advocates for contamination-resistant evaluation methods and diversified data sources to address the ethical and practical challenges of bias in AI assessments.
The paper "Preference Leakage: A Contamination Problem in LLM-as-a-judge" addresses a subtle yet critical bias issue, referred to as preference leakage, which surfaces in scenarios where LLMs are employed both as data generators and evaluators. The paper identifies the potential for LLM-based synthetic data generation and evaluation to result in systematic bias due to the relatedness between data-generator LLMs and judge LLMs. This bias is coined as preference leakage and remains challenging to detect.
Key Contributions and Findings
- Problem Definition: Preference leakage arises when LLMs used for data generation and evaluation are related, leading to a bias where the evaluative LLM favors the outputs generated by its related student models. The paper specifies three kinds of relatedness:
- Identical models: When the generating and judging LLMs are the same.
- Inheritance relationship: Where either the data generator or evaluator is derived from the other.
- Belonging to the same model family: Such as within models of the GPT or Gemini families.
- Research Questions: The paper articulates three core research questions:
- RQ1: Does preference leakage introduce systemic biases in LLM evaluations?
- RQ2: What is the severity of preference leakage across different scenarios?
- RQ3: What mechanisms underlie preference leakage?
- Experimental Analysis: Through experiments involving widely used LLM-as-a-judge benchmarks, such as Arena-Hard and AlpacaEval 2.0, the authors reveal significant evidence of bias in favor of related models, with the degree of bias correlating with the closeness of the models' relationship and the proportion of synthetic data.
- Findings:
- Bias Prevalence: Judges exhibit a clear preference for their related student models, indicating widespread preference leakage across various LLMs.
- Impact of Relatedness: The degree of relatedness significantly influences bias severity, with models from the same series exhibiting more profound bias effects.
- Influence of Tuning and Model Size: Supervised fine-tuning (SFT) exacerbates preference leakage compared to other methods. Additionally, larger student models appear to amplify the bias attributed to preference leakage due to their enhanced memory and learning capabilities.
- Recognition and Challenges: Investigating whether the bias stems from judges recognizing their own model's outputs, the paper finds that LLM judges do not effectively identify outputs from their related student models. This suggests that preference leakage is a more insidious problem compared to previously documented egocentric biases.
- Categorical Bias Analysis: The paper shows that questions with subjective answers, such as those related to writing and programming, along with subjective judgment dimensions, are more susceptible to bias, further complicating the detection and mitigation of preference leakage.
Implications and Future Directions
The paper calls attention to preference leakage as an underappreciated issue in LLM-based automatic evaluations, stressing the need for more robust evaluation methodologies that counteract this bias. It proposes exploring diversified data sources, contamination-resistant benchmarks, and evaluation strategies to mitigate bias risks effectively. The research underscores the ethical ramifications of biased evaluations in AI systems, where hidden preference patterns could adversely impact downstream applications such as AI alignment and decision-making processes in critical areas.
Overall, the research highlights the nuance of preference leakage and sets the foundation for future inquiries dedicated to understanding and addressing the biases inherent in the LLM-as-a-judge paradigm.