LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks (2406.18403v2)

Published 26 Jun 2024 in cs.CL

Abstract: There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

PDF HTML Abstract

Evaluating the Validity of LLMs as Judges in NLP: A Detailed Analysis

The paper "LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks" by Anna Bavaresco, Raffaella Bernardi, and their co-authors provides a comprehensive empirical paper on the feasibility of using LLMs for evaluating NLP tasks traditionally judged by humans. This paper is critical given the growing trend of employing LLMs in place of human judges, which raises questions about the validity and reproducibility of such assessments.

Overview of the Research and Methodology

The authors introduce Judge-Bench, a new benchmarking collection encompassing 20 diverse NLP datasets with human annotations. The paper evaluates 11 contemporary LLMs—including both open-weight and proprietary models—across these datasets to measure how well LLM-generated judgments align with human annotations.

Key Findings

One of the paper's central findings is the significant variance exhibited by the LLMs across different datasets, which indicates that these models are not yet ready to replace human judges in NLP tasks systematically. Specific observations can be summarized as follows:

Variability Across Models and Tasks: Each LLM demonstrated inconsistent performance across datasets. For instance, proprietary models like GPT-4o showed high correlation on some tasks but performed poorly on others.
Comparison of Human and LLM Judgments: The paper revealed a decreasing gap between open and closed models, with Llama3-70B emerging as a close second to GPT-4o. This suggests promising directions for open models' reproducibility.
Performance on Different Annotation Types: The LLMs showed better alignment with human judgments when assessing human-generated language as compared to machine-generated texts.

Implications and Recommendations

The results have significant implications for both practical applications and theoretical understandings of NLP model evaluation. The authors recommend caution when using LLMs to replace human judges due to the variability in performance and the potential for misleading conclusions. Moreover, they highlight the issues of data leakage and reproducibility, especially with proprietary models, suggesting a need for transparency and standardization in future evaluations.

Future Directions

Future research in this domain could explore:

Refining Prompts: Further studies could investigate how different prompt engineering strategies affect LLM performance in evaluations.
Multi-Lingual Evaluation: While this paper focused on English, extending Judge-Bench to include other languages could provide more comprehensive insights.
Mitigating Biases: Additional work is needed to understand and mitigate the biases LLMs might introduce in fine-grained tasks such as toxicity evaluation.

Conclusion

The paper rigorously examines the current capabilities of LLMs to serve as judges in various NLP tasks. While LLMs demonstrate potential, their inconsistent performance underscores the necessity of continued reliance on human judgment in many cases. The authors contribute valuable tools and methodologies, positioning Judge-Bench as a living benchmark for future research.

The release of Judge-Bench, along with its accompanying codebase, promises to facilitate ongoing and future research in this space, encouraging transparency and reproducibility in NLP model evaluations.