JudgeBench: A Benchmark for Evaluating LLM-based Judges (2410.12784v1)

Published 16 Oct 2024 in cs.AI, cs.CL, and cs.LG

Abstract: LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

PDF HTML Abstract

Evaluating LLM-Based Judges with JudgeBench

The paper introduces JudgeBench, a benchmark designed to assess the capability of LLM-based judges in distinguishing factually and logically correct responses. The motivation arises from the increasing adoption of LLM-based judges as scalable alternatives to human evaluations, which are often costly and labor-intensive. However, little is known about the reliability of these LLM-based judges themselves, particularly in handling complex tasks that require advanced reasoning capabilities.

Key Contributions

The authors present a hierarchical evaluation framework for assessing LLM-based judges, prioritizing factual and logical correctness over stylistic alignment with human preferences. The framework aids in structuring future evaluation datasets by emphasizing objective assessments of model outputs rather than subjective human biases.

JudgeBench comprises a novel pipeline that transforms existing datasets with ground truth labels into pairs of responses, enabling the evaluation of LLM-based judges. This process ensures that the benchmark remains challenging, leveraging difficult tasks from datasets across four categories: Knowledge, Reasoning, Mathematics, and Coding.

Evaluation Results

The evaluation on JudgeBench showcases several insights:

Performance Gaps: The evaluation highlights that many state-of-the-art LLM-based judges perform no better than random guessing on JudgeBench tasks. Despite advancements, even sophisticated models like GPT-4o barely surpass the random baseline, pointing toward an area ripe for methodological improvement.
Fine-Tuned vs. Prompted Judges: Fine-tuned judges, despite being tailored for improved performance, often underperformed compared to prompted judges. This discrepancy may be attributed to limitations in the datasets used for fine-tuning or the inherent challenges posed by JudgeBench.
Model Size and Complexity: Larger models tend to perform better, suggesting that increased computational resources could enhance judges' reasoning abilities.
Reward Models' Capability: Reward models, trained on preference data, demonstrate competitive performance, indicating potential pathways for developing specialized evaluators from less powerful models.

Implications for Future Research

The complexities highlighted by JudgeBench suggest that improving the reasoning abilities of LLM-based judges is crucial as AI systems scale in complexity. Moreover, the work emphasizes a need for benchmarks that test objective correctness rather than stylistic considerations to ensure the sustainable development of AI evaluation mechanisms.

Future research could explore the design of improved training datasets and fine-tuning methods that enhance the logical reasoning capabilities of LLM-based judges. Additionally, the integration of reasoning-enhanced features, as seen in models like o1-preview, could be a promising direction in advancing LLM-based evaluations.

Conclusion

JudgeBench stands out as a robust platform for objectively evaluating the performance of LLM-based judges under challenging conditions that mirror real-world complexities. The benchmark's emphasis on logical correctness over subjective preferences offers a clear path forward in the development and assessment of AI evaluation models. Researchers are encouraged to leverage JudgeBench to foster advancements in the reasoning capabilities of automated judges, paving the way for more reliable and effective AI systems.