Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMEval: A Preliminary Study on How to Evaluate Large Language Models (2312.07398v2)

Published 12 Dec 2023 in cs.AI and cs.CL
LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Abstract: Recently, the evaluation of LLMs has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/LLMeval .

Introduction

The assessment of LLMs is a crucial endeavor as it influences further development and understanding of AI capabilities. Current research in this domain has afforded considerable attention to the tasks LLMs should undertake and the domains they should encompass; yet the methodologies for evaluation have been comparatively underexplored. This paper aims to address this gap by thoroughly examining evaluation methods, encompassing criteria selection, annotation methods, and ranking systems. With an expansive paper involving 20 LLMs, over 2,000 individuals, and a combination of manual and automatic evaluations generating substantial data, this research provides a framework and actionable insights into how best to evaluate LLMs.

Studies and Methodology

Diverse methodologies across previous works have established varied approaches for LLM evaluation. In contrast to automated evaluations championed by studies like HELM and MMLU, this research underscores the significance of manual evaluations, drawing from a comprehensive pool of annotators. The paper spotlights both the strengths and limitations of manual evaluations compared to LLM-based automatic evaluations like BERTScore and discusses the nuances introduced by public versus onsite annotators.

The aptly named LLMEval dataset is the cornerstone of this comparison, delivering a novel resource for evaluating a multitude of LLMs across a broad spectrum with the promise of unbiased results, thanks to a double-blind testing approach. Furthermore, by comparatively analyzing different scoring criteria—from informativeness to fluency—this paper carves out a nuanced picture of what truly distinguishes LLMs in performance assessments.

Results and Discussion

Unveiling the findings, this paper reports that onsite evaluations outperform other manual formats in accuracy, suggesting this as the preferable route for future studies. Intriguingly, the findings also suggest that automated evaluations have considerable merit, particularly with regard to handling voluminous tasks—yet the divide between automated and manual methods is most evident in the subjective dimension. As expected, public annotations are less reliable due to their lower accuracy and consistency, reinforcing the value of expert involvement in the evaluative process.

Pertaining to ranking systems, the paper critically analyzes Elo rating versus Points scoring systems, with important implications for using these systems within LLM evaluations. The Elo system, despite its repute in chess, demonstrates fluctuations that negate its stability for LLM evaluations—a key insight for researchers deploying competitive pairwise evaluations.

Concluding Remarks

The paper's contributions are manifold. Importantly, it catalyzes the ongoing discourse on the 'how' of LLM evaluation, a conversation far from mature in the AI community. Additionally, LLMEval offers a repository that's invaluable for future research, allowing for variation in evaluation approaches. Coupled with ten succinctly presented conclusions stemming from the elaborate comparison between different evaluation formats, this paper advances the scholarly community’s tools for dissecting and understanding the performance of LLMs.

It also pens a path forward for subsequent evaluations, emphasizing the premium combination of informativeness and accuracy as differentiators. Recognizing the complementary nature of automated and manual evaluations, the paper argues for their concurrent usage, especially since automated methods align reasonably with human judgment. As AI systems veer closer to human-like cognition, these insights are imperative, ensuring that evaluative processes evolve in tandem with technological developments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861.
  2. A Survey on Evaluation of Large Language Models.
  3. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387.
  4. GPTScore: Evaluate as You Desire. arXiv:2302.04166.
  5. Measuring Massive Multitask Language Understanding. arXiv:2009.03300.
  6. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322.
  7. From Word Embeddings To Document Distances. In International Conference on Machine Learning.
  8. Holistic Evaluation of Language Models. arXiv:2211.09110.
  9. Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Annual Meeting of the Association for Computational Linguistics.
  10. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
  11. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  12. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. arXiv:2303.04048.
  13. Large Language Models are not Fair Evaluators. arXiv:2305.17926.
  14. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675.
  15. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. arXiv:1909.02622.
  16. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
  17. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yue Zhang (618 papers)
  2. Ming Zhang (313 papers)
  3. Haipeng Yuan (1 paper)
  4. Shichun Liu (8 papers)
  5. Yongyao Shi (1 paper)
  6. Tao Gui (127 papers)
  7. Qi Zhang (784 papers)
  8. Xuanjing Huang (287 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com