Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments (2406.11370v2)

Published 17 Jun 2024 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: LLMs have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.

PDF HTML Abstract

Analyzing Human Alignment in LLM Judgments Through Fair Preference Optimization

This paper investigates the sensitivity of LLMs to prompt designs and their resultant biases in pairwise preference evaluations. While LLMs have shown potential as autonomous evaluators in multiple language generation contexts, the challenge remains that they exhibit preference biases potentially leading to misalignment with human judgments. The authors address this challenge by proposing an innovative framework called ZEPO (Zero-shot Preference Optimization), which aims to enhance the fairness of LLM judgments by short-circuiting the alignment process with human preferences.

The central observation of the research is the inconsistent predictive preferences exhibited by LLMs when exposed to paraphrased yet semantically equivalent instructions. The authors meticulously document that LLMs can produce skewed preference distributions, where judgments vary significantly with minimal prompt alterations. This phenomenon calls into question their reliability in objective evaluations, necessitating a method to stabilize their outputs towards what humans would deem just and coherent judgments.

To tackle these inconsistencies, ZEPO employs a zero-shot learning objective focused on optimizing preference fairness. The framework is designed without reliance on labeled data, instead leveraging the LLM's intrinsic output distributions. The primary learning metric compared the model's decision distribution to a theoretical uniform distribution, which intuitively reflects a fair judgmental outcome.

Experimentally, ZEPO demonstrated commendable performance improvements over existing state-of-the-art LLM evaluators. It was implemented on several benchmark datasets across varying domains, including summarization and dialogue. The evaluators, post-ZEPO application, consistently yielded decisions more closely matching human-perceived preferences. The results underscore the significance of alignment between the preference distribution fairness and human judgments, with ZEPO emerging as an efficient method to bridge the gap.

The theoretical implications of this paper pivot on the critical notion of fairness in machine judgment. By exposing preference bias in LLM responses, this paper inspires further research into aligned AI systems that more accurately represent human viewpoints. Practically, its methodology can influence the development of more robust LLM evaluators which are decisive yet fair, improving applications ranging from automated content generation to sophisticated human-machine interactive systems.

Future implications of the ZEPO framework suggest its utility in not only aligning current models but also potentially guiding the design of future LLMs. By integrating fairness principles at the core of LLM evaluative processes, it may reduce the necessity for extensive prompt-crafting, streamlining the use of LLMs across various AI disciplines. Furthermore, the synergy explored between ZEPO and existing debiasing techniques points to a multidisciplinary approach in developing AI systems with inherent fairness and precision.

Overall, this research provides an insightful step toward understanding and improving preference fairness in AI evaluations, significantly increasing the alignment of LLM-derived judgments with human perspectives without recourse to extensive annotated datasets. This work highlights the pivotal role of fairness in human-aligned AI, potentially influencing a broader range of applications beyond conventional text evaluation.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Han Zhou (72 papers)
Xingchen Wan (31 papers)
Yinhong Liu (16 papers)
Nigel Collier (83 papers)
Ivan Vulić (130 papers)
Anna Korhonen (90 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/hanzhou032/status/1802985095348789553

https://twitter.com/hanzhou032/status/1803613518534512743

https://twitter.com/wanxingchen_/status/1803137616554107223

https://twitter.com/realmofresearch/status/1804362789856907729