Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

98 tokens/sec

GPT-4o

61 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

7 1 1

Evaluating Large Language Models: A Comprehensive Survey (2310.19736v3)

Published 30 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.

PDF HTML Abstract

Overview of LLM Evaluation Frameworks

The paper, "Evaluating LLMs: A Comprehensive Survey," offers a detailed exploration into the evaluation systems and methodologies for LLMs. This survey categorizes evaluations into three primary groups: knowledge and capability evaluation, alignment evaluation, and safety evaluation. It aims to provide researchers with structured insights into the challenges and performance of LLMs across various specialized domains.

Knowledge and Capability Evaluation

The paper discusses the significance of evaluating LLMs’ knowledge and reasoning capabilities, emphasizing diverse methods to assess aspects such as question answering, knowledge completion, and reasoning skills. Datasets like SQuAD and benchmarks like MMLU are used to examine capabilities, highlighting the importance of dynamic and comprehensive evaluations. The paper stresses the need for evaluating tool learning and manipulation, illustrating this with benchmarks like API-Bank for assessing how models interact with tools to perform tasks effectively.

Alignment Evaluation

The discussion on alignment evaluation focuses on ensuring that LLMs produce outputs aligned with ethical and moral standards. This includes evaluating biases, toxicity, and truthfulness. The survey describes datasets and metrics used to assess these factors, such as Social Chemistry 101 and REAL Toxicity Prompts. It emphasizes the necessity of refining LLMs to minimize societal biases and misinformation, reinforcing the models’ alignment with human values through rigorous testing.

Safety Evaluation

The survey highlights safety evaluation by categorizing it into robustness and risk assessment. Robustness evaluation examines how LLMs handle adversarial inputs and unexpected scenarios, using tools like PromptBench. Risk assessment focuses on evaluating the potential for harmful behaviors, such as power-seeking, using frameworks like AgentBench. The goal is to develop systems resistant to adversarial manipulation and unintentional harmful outputs.

Specialized Domain Applications

The paper extends the evaluation discourse to specialized domains, including medicine, finance, and education, emphasizing the application-specific challenges and benchmarks. For instance, in the medical field, LLMs are tested against standardized exams like the USMLE to ensure reliability in clinical decision support.

Holistic Evaluation Approaches

The survey also presents holistic evaluation frameworks like HELM and OpenAI Evals, which integrate multiple dimensions of assessment, including comprehensiveness, robustness, and alignment. These benchmarks aim to capture a full spectrum of LLM capabilities and facilitate an understanding of their performance across diverse and complex scenarios.

Future Directions

The survey speculates on future directions, advocating for evaluations that are dynamic, comprehensive, and centered on real-world applications. It emphasizes enhancement-oriented evaluations that do not merely benchmark capabilities but also identify weaknesses, providing avenues for improvements. This future-oriented approach aims to align the evolution of LLMs with societal needs and ethical standards, promoting safer and more effective AI deployment.

The paper serves as a vital resource for understanding the complexity and breadth of LLM evaluation. By offering a structured taxonomy and highlighting specific benchmarks and methodologies, it provides a foundation for advancing AI research and development, ensuring that LLM evolution aligns with societal benefits and ethical standards.

PDF Markdown Bookmark Chat (Pro)

References (393)

Authors (11)

Zishan Guo (5 papers)
Renren Jin (17 papers)
Chuang Liu (71 papers)
Yufei Huang (81 papers)
Dan Shi (4 papers)
Linhao Yu (10 papers)
Yan Liu (419 papers)
Jiaxuan Li (52 papers)
Bojian Xiong (1 paper)
Deyi Xiong (103 papers)
Supryadi (5 papers)

Citations (136)

View on Semantic Scholar

GitHub

GitHub - tjunlp-lab/Awesome-LLMs-Evaluation-Papers: The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey. (607 stars)

Tweets

https://twitter.com/WaleAkinfaderin/status/1825613634715943322

https://twitter.com/el_ateifSara/status/1775931240266272892

YouTube

Show All Videos