LLMEval: A Preliminary Study on How to Evaluate Large Language Models (2312.07398v2)

Published 12 Dec 2023 in cs.AI and cs.CL

Abstract: Recently, the evaluation of LLMs has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/LLMeval .

PDF HTML Abstract

Introduction

The assessment of LLMs is a crucial endeavor as it influences further development and understanding of AI capabilities. Current research in this domain has afforded considerable attention to the tasks LLMs should undertake and the domains they should encompass; yet the methodologies for evaluation have been comparatively underexplored. This paper aims to address this gap by thoroughly examining evaluation methods, encompassing criteria selection, annotation methods, and ranking systems. With an expansive paper involving 20 LLMs, over 2,000 individuals, and a combination of manual and automatic evaluations generating substantial data, this research provides a framework and actionable insights into how best to evaluate LLMs.

Studies and Methodology

Diverse methodologies across previous works have established varied approaches for LLM evaluation. In contrast to automated evaluations championed by studies like HELM and MMLU, this research underscores the significance of manual evaluations, drawing from a comprehensive pool of annotators. The paper spotlights both the strengths and limitations of manual evaluations compared to LLM-based automatic evaluations like BERTScore and discusses the nuances introduced by public versus onsite annotators.

The aptly named LLMEval dataset is the cornerstone of this comparison, delivering a novel resource for evaluating a multitude of LLMs across a broad spectrum with the promise of unbiased results, thanks to a double-blind testing approach. Furthermore, by comparatively analyzing different scoring criteria—from informativeness to fluency—this paper carves out a nuanced picture of what truly distinguishes LLMs in performance assessments.

Results and Discussion

Unveiling the findings, this paper reports that onsite evaluations outperform other manual formats in accuracy, suggesting this as the preferable route for future studies. Intriguingly, the findings also suggest that automated evaluations have considerable merit, particularly with regard to handling voluminous tasks—yet the divide between automated and manual methods is most evident in the subjective dimension. As expected, public annotations are less reliable due to their lower accuracy and consistency, reinforcing the value of expert involvement in the evaluative process.

Pertaining to ranking systems, the paper critically analyzes Elo rating versus Points scoring systems, with important implications for using these systems within LLM evaluations. The Elo system, despite its repute in chess, demonstrates fluctuations that negate its stability for LLM evaluations—a key insight for researchers deploying competitive pairwise evaluations.

Concluding Remarks

The paper's contributions are manifold. Importantly, it catalyzes the ongoing discourse on the 'how' of LLM evaluation, a conversation far from mature in the AI community. Additionally, LLMEval offers a repository that's invaluable for future research, allowing for variation in evaluation approaches. Coupled with ten succinctly presented conclusions stemming from the elaborate comparison between different evaluation formats, this paper advances the scholarly community’s tools for dissecting and understanding the performance of LLMs.

It also pens a path forward for subsequent evaluations, emphasizing the premium combination of informativeness and accuracy as differentiators. Recognizing the complementary nature of automated and manual evaluations, the paper argues for their concurrent usage, especially since automated methods align reasonably with human judgment. As AI systems veer closer to human-like cognition, these insights are imperative, ensuring that evaluative processes evolve in tandem with technological developments.

PDF Markdown Bookmark Chat (Pro)

References (17)

Authors (8)

Yue Zhang (618 papers)
Ming Zhang (313 papers)
Haipeng Yuan (1 paper)
Shichun Liu (8 papers)
Yongyao Shi (1 paper)
Tao Gui (127 papers)
Qi Zhang (784 papers)
Xuanjing Huang (287 papers)

Citations (4)

View on Semantic Scholar

LLMEval: A Preliminary Study on How to Evaluate Large Language Models (2312.07398v2)

Introduction

Studies and Methodology

Results and Discussion

Concluding Remarks

Related Papers

GitHub

YouTube