Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (2403.16950v3)

Published 25 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. In this work, we first conduct a systematic study of the misalignment between LLM evaluators and human judgement, revealing that existing calibration methods aimed at mitigating biases are insufficient for effectively aligning LLM evaluators. Inspired by the use of preference data in RLHF, we formulate the evaluation as a ranking problem and introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts. PairS achieves state-of-the-art performance on representative evaluation tasks and demonstrates significant improvements over direct scoring. Furthermore, we provide insights into the role of pairwise preference in quantifying the transitivity of LLMs and demonstrate how PairS benefits from calibration.

PDF Abstract

Aligning with Human Judgment: The Role of Pairwise Preference in LLM Evaluators

The paper "Aligning with Human Judgment: The Role of Pairwise Preference in LLM Evaluators" by Liu et al. addresses the critical challenge of improving the alignment between LLM evaluators and human judgment. Despite their capabilities, LLMs have been found to exhibit biases and inconsistencies when used as evaluators for generated text. This paper systematically analyzes these inconsistencies, revealing the limitations of existing calibration methods. It introduces the Pairwise Preference Search (PairS) as a strategy to align LLM evaluators more closely with human perspectives.

Motivation and Methodology

The research begins by examining the misalignment of LLMs' output evaluations with human judgments, focusing on systematic biases that hinder reliability. Conventional calibration techniques, although intended to mitigate bias, fall short in effectively aligning the evaluations, primarily due to divergences between the evaluation standards of LLMs and humans rather than biased priors alone.

Inspired by reinforcement learning from human feedback (RLHF), which utilizes ranked comparisons to align LLMs with human preferences, the authors propose reformulating the evaluation process as a ranking problem. Pairwise preference, as opposed to direct scoring, offers a promising approach to closer alignment. By viewing evaluation as a ranking problem, the paper introduces PairS as an uncertainty-guided search method leveraging pairwise comparisons to efficiently rank candidate texts.

Numerical Results and Findings

PairS demonstrates state-of-the-art performance across various evaluation tasks, significantly outperforming direct scoring methods. This is evidenced by strong Spearman correlations with human judgments in tasks such as summarization and open-ended generation. PairS efficiently estimates Maximum Likelihood Estimates (MLEs) for preference ranking, thereby achieving robust and transitive evaluations. The introduction of these pairwise comparisons not only reduces computational complexity but also improves the alignment with human evaluative standards.

Implications and Future Directions

The implications of employing pairwise preference in LLM evaluators are both practical and theoretical. Practically, it suggests a more reliable and efficient method for evaluating natural language generation, aligning closely with human assessments and thereby improving trust and applicability in various domains. Theoretically, it provides an insight into the transitivity of LLM evaluations and sets a benchmark for future model alignments with human feedback.

The paper opens several avenues for future exploration. Calibration using pairwise preferences could be further optimized and tested across a wider range of LLMs and contextual applications. Additionally, investigating the structural and architectural components of LLMs that account for transitivity in evaluations could provide deeper insights into their design for improved human alignment.

Conclusion

The paper conducted by Liu et al. serves as a testament to the importance of aligning AI-generated evaluations with human judgment. By holistically analyzing the limitations of existing calibration regimes and proposing a novel pairwise preference strategy, it effectively elevates the standard for LLM evaluations. This paper will likely serve as a foundation for future research into enhancing LLM evaluator reliability, pushing toward more refined and human-aligned AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yinhong Liu (16 papers)
Han Zhou (72 papers)
Zhijiang Guo (55 papers)
Ehsan Shareghi (54 papers)
Ivan Vulić (130 papers)
Anna Korhonen (90 papers)
Nigel Collier (83 papers)

Citations (32)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/YinhongLiu2/status/1813218803691450577

https://twitter.com/ZhijiangG/status/1778691920015032332

https://twitter.com/YinhongLiu2/status/1778556655782814145

https://twitter.com/hanzhou032/status/1803613802056790343

https://twitter.com/hanzhou032/status/1778549122498212176

https://twitter.com/YinhongLiu2/status/1813354050462642304