Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Published 12 Dec 2023 in cs.AI and cs.CL | (2312.07398v2)

Abstract: Recently, the evaluation of LLMs has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/llmeval .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861.
  2. A Survey on Evaluation of Large Language Models.
  3. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387.
  4. GPTScore: Evaluate as You Desire. arXiv:2302.04166.
  5. Measuring Massive Multitask Language Understanding. arXiv:2009.03300.
  6. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322.
  7. From Word Embeddings To Document Distances. In International Conference on Machine Learning.
  8. Holistic Evaluation of Language Models. arXiv:2211.09110.
  9. Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Annual Meeting of the Association for Computational Linguistics.
  10. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634.
  11. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311โ€“318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  12. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. arXiv:2303.04048.
  13. Large Language Models are not Fair Evaluators. arXiv:2305.17926.
  14. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675.
  15. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. arXiv:1909.02622.
  16. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
  17. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364.
Citations (4)

Summary

  • The paper introduces LLMEval, a novel dataset and evaluation methodology that goes beyond traditional metrics to assess large language models.
  • It employs both star scoring and pairwise comparison methods to evaluate criteria including accuracy, fluency, informativeness, logical coherence, and harmlessness.
  • Results indicate that onsite star scoring and conversation tasks best differentiate model capabilities, offering a pathway for future LLM assessments.

Evaluating LLMs: Insights from LLMEval

This essay provides a detailed exploration of the "LLMEval: A Preliminary Study on How to Evaluate LLMs" paper, focusing on the evaluation methodologies deployed for LLMs. It addresses the fundamental question of how to assess LLM performance beyond traditional metrics, proposing novel datasets and evaluation frameworks.

Introduction to LLM Evaluation

The paper acknowledges the limitations of standard metrics like BLEU and ROUGE for evaluating LLMs, highlighting the need for evaluations that consider broader aspects of LLM capabilities such as accuracy, fluency, informativeness, logical coherence, and harmlessness. It introduces LLMEval, a dataset developed to test these criteria, and discusses the role of manual and automated evaluations through various annotators and systems, including GPT-4.

Evaluation Criteria and Tasks

Criteria Selection

The study refines the evaluation criteria into five key areas:

  • Accuracy: Focused on the factual correctness of responses.
  • Fluency: Evaluating adherence to natural language norms.
  • Informativeness: Assessing the usefulness of the information.
  • Logical Coherence: Ensuring logical consistency in responses.
  • Harmlessness: Avoiding unethical or harmful content. Figure 1

    Figure 1: Scoring of Different Criteria in LLMEval-1. Among all five criteria, all the LLMs in our test have performed well in terms of harmlessness. The most distinguishing criteria are accuracy and informativeness.

Task Distribution

The tasks for evaluation span 17 types, including math solving, code generation, and conversation. LLMEval-1 primarily addresses general tasks, while LLMEval-2 explores specialized domains like computer science and medicine. Figure 2

Figure 2: Scoring of Different Tasks in LLMEval-1. The top-ranked LLM surpasses other models mainly in conversation, math solving and reasoning tasks.

Annotation Methods

The paper employs both automatic (GPT-4) and manual evaluations with different annotator types, including onsite and crowd-sourced evaluations.

Star Scoring vs. Pairwise Comparison

Two main scoring methods are utilized:

  • Star Scoring: Annotators provide scores from one to three stars for each criterion.
  • Pairwise Comparison: Comparisons are made between pairs of LLM responses to determine superiority. Figure 3

    Figure 3: Onsite annotators exhibit the best quality in terms of accuracy and consistency, higher than crowd-sourcing and public pairwise comparison evaluation.

Manual vs. Automated Evaluations

The study finds onsite star scoring to exhibit the highest accuracy and consistency, while automated evaluations align closely with manual results in certain settings, particularly with star scoring. Figure 4

Figure 4: Annotators tend to give higher scores when answer hints are not provided.

Ranking Systems and Stability

The paper investigates two ranking systems for pairwise comparisons: the Elo rating system and the Points scoring system.

Challenges with the Elo Rating System

The Elo system, widely used in competitive scoring, showed instability in large-scale annotations due to sequence dependence and sensitivity to noise. Figure 5

Figure 5: The fluctuation of Elo rating result after 100,000 rounds of pairwise comparison is still immense.

Figure 6

Figure 6: In the Elo rating system, the same annotations can lead to changes in rank and score due to different orders.

Results and Implications

Key Findings

  1. Informativeness and accuracy are the most differentiating criteria.
  2. Onsite star scoring is the preferred evaluation method.
  3. Conversation tasks best distinguish model capabilities.
  4. Automated evaluations hold potential but need refinement, particularly in handling subjective questions.

Practical Implications

The insights from LLMEval could inform the development of more robust LLM assessment frameworks, especially in distinguishing capabilities across different tasks and domains.

Conclusion

The paper's comprehensive evaluation approach suggests several improvements in how LLMs are assessed, emphasizing the importance of varied criteria and robust scoring methodologies. Future research may build upon LLMEval's findings to develop even more nuanced evaluation techniques, enhancing our understanding of LLM capabilities and limitations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.