Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

JuStRank: Benchmarking LLM Judges for System Ranking (2412.09569v2)

Published 12 Dec 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

Summary

  • The paper introduces JuStRank as a novel benchmark that evaluates LLM judges’ system-level ranking capabilities, overcoming the limits of instance-level assessments.
  • It quantitatively assesses 48 diverse judges, showing that models with moderate instance performance can achieve high ranking correlations, up to a Kendall's Tau of .827 with human judgments.
  • The study underscores the significance of identifying judge biases and decisiveness factors, guiding more reliable AI system evaluations and informed model selections.

Assessment and Benchmarking of LLM Judges for System Ranking

The paper "JuStRank: Benchmarking LLM Judges for System Ranking" presents a comprehensive evaluation approach for benchmarking LLM-based judges. This work primarily focuses on addressing a critical gap in the evaluation ecosystem where traditional instance-level evaluations fail to appropriately capture the capabilities of judges when ranking methodologies demand system-level insights.

Core Contributions and Methodology

The authors propose JuStRank, a novel benchmark specifically designed to evaluate LLM judges employed in the ranking of respondent systems. The urgency of such evaluations arises from the increasing reliance on LLMs for judging the quality of outputs generated by other AI systems, as manual assessments are both time-consuming and prone to human biases.

JuStRank leverages a suite of 48 state-of-the-art judges, covering a diverse range of general-purpose LLMs and specially designed reward models. These judges are assessed based on their ability to generate rankings aligned with the system performance rankings derived from human judgments. A key strength of this paper lies in its methodical approach to systematic judge characterization, quantifying the decisive behavior of judges and their bias towards specific systems.

Key Findings and Results

The paper reveals several crucial findings:

  1. System-Level Evaluations: The research underscores the inadequacy of relying solely on instance-level evaluations to determine the proficiency of LLM judges in system rankings. Even models with mediocre instance-level performance can effectively rank systems when evaluated using system-level methodologies.
  2. Emergent Qualities and Bias: Empirical analysis indicates that some judges exhibit strong system biases, affecting their reliability in ranking tasks. In particular, the emergent decisiveness factor, quantified by a cumulative beta distribution, highlights how certain judges amplify differences more than humans do, which can skew their output decisively toward extreme evaluations.
  3. Judge Performance: The paper reports high-ranking agreement correlations, with some judges demonstrating a Kendall's Tau correlation of up to .827 with the human-grounded rankings, showcasing that certain judges approach human-level judgment reliability.
  4. Insights into Realizations and Aggregations: The choice of judge realizations (e.g., Numeric, Likert) affects the ranking outcome significantly, revealing that verbalized scores (like Likert) often lead to better-calibrated evaluations compared to comparative anchors or token probability methods.

Implications and Future Directions

The implications of this research extend into both practical and theoretical realms. From an applied standpoint, selecting an effective LLM judge enables more accurate model comparison, crucial for applications in systems development where model evaluations directly inform strategic decisions. Theoretically, these findings necessitate a reevaluation of current benchmarking strategies, emphasizing system-level evaluation metrics over traditional instance-level assessments.

For future developments, there lies a fertile ground in exploring tailored system-level LLM judges, judge ensemble models to mitigate biases, and the application of different aggregation heuristics that may yield more representative scoring systems. Additionally, extending the examination beyond English to evaluate judge behavior across different linguistic and cultural contexts can offer significant insights into universal judge design principles.

In summary, this paper makes a significant contribution by providing an exhaustive benchmark for correlating LLM judge scores with a more holistic system ranking paradigm, encouraging the community to revisit and refine the framework of AI evaluations in a systematically structured manner.