Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2406.12624v4)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating LLMs. However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.

PDF HTML Abstract

Evaluating the Alignment and Vulnerabilities of LLMs as Judges

In the recent work titled Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, researchers Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes present a comprehensive evaluation of using LLMs as judges in assessing responses from other LLMs. As computational models become more capable, scaling up human evaluation poses challenges in domains that require nuanced judgment. The paper aims to determine how well different LLM judges align with human evaluators and uncover the strengths and vulnerabilities of the LLM-as-a-judge paradigm.

The authors examined thirteen judge models across varying sizes and families, evaluating nine diverse "exam-taker" models that include both base and instruction-tuned versions. Their findings reveal that only the largest models such as GPT-4 and Llama variants achieve reasonable alignment with human judgments, but still fall short of reaching inter-human agreement levels. Notably, score alignment differences of up to 5 points persist between human and LLM-assigned scores, suggesting intrinsic discrepancies in judgment across different models.

A key insight from this paper is the identification of inherent vulnerabilities within judge LLMs. Judges exhibited sensitivity to the complexity and length of prompts, along with a bias towards lenient evaluations. Moreover, alignment metrics like percent agreement, commonly used in evaluation, do not adequately reflect true alignment as they fail to distinguish variations among models effectively. Scott's Pi coefficient proved to be a more reliable metric, offering greater insight into the judges' performance against human standards.

The implications of this research extend both practically and theoretically. Practically, while LLM judges can serve as scalable, cost-effective evaluators, the caution against over-reliance on alignment metrics is warranted due to their observed divergence from human judgment in structured tasks. Theoretically, the findings indicate that despite advanced training, even state-of-the-art LLMs retain biases and limitations that necessitate closer examination and refinement.

The paper speculates on future developments wherein the refinement of alignment techniques and better understanding of LLM limitations could improve judge model effectiveness. Additionally, the differentiation between how models perform in terms of discrimination rather than absolute scoring may inform future enhancements and applications of LLM judges.

In conclusion, the comprehensive assessment conducted in this paper informs the broader scientific discourse on evaluating LLMs and emphasizes the utility and limitations of employing LLMs as evaluators in both academic and industry settings. Future work will likely further probe these aspects across more complex scenarios, aiming to enhance alignment and address intricacies in automated evaluation tasks.