Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

Published 15 Aug 2023 in cs.CL and cs.AI | (2308.07902v1)

Abstract: From pre-trained LLM (PLM) to LLM, the field of NLP has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper presents a core competency framework to address key evaluation challenges for large language models.
It benchmarks reasoning, knowledge, reliability, and safety using tasks like the Winograd Schema Challenge and factual consistency tests.
The study recommends dynamic, real-world evaluation tasks and transparent metrics to drive future improvements in LLM performance.

The paper "Through the Lens of Core Competency: Survey on Evaluation of LLMs," published in August 2023, provides a comprehensive survey on the evaluation of LLMs. It tackles the underlying challenges of evaluation in the context of significant advancements in NLP, given the transition from pre-trained LLMs (PLMs) to LLMs. The authors argue that traditional NLP evaluation tasks and benchmarks are no longer sufficient due to the exceptional performance of LLMs and the expansive range of their real-world applications.

To address these evaluation challenges, the paper introduces a framework centered around four core competencies for assessing LLMs:

Reasoning:
- Definition: This competency involves the model's ability to perform logical inferences, understand context, and apply common-sense knowledge to draw conclusions.
- Benchmarks and Metrics: Benchmarks such as the Winograd Schema Challenge and logical reasoning tasks are included to evaluate this competency.
Knowledge:
- Definition: This focuses on the LLM's capacity to store and retrieve information accurately from its training data.
- Benchmarks and Metrics: Tasks like question answering and factual consistency checks are employed to measure this competency.
Reliability:
- Definition: This competency covers the consistency and robustness of the model’s outputs, especially in providing reliable information over varied contexts.
- Benchmarks and Metrics: Robustness checks, adversarial testing, and consistency evaluation metrics are used to assess reliability.
Safety:
- Definition: This highlights the importance of ensuring that LLMs generate harmless and ethically sound outputs, mitigating biases and preventing harmful content.
- Benchmarks and Metrics: Safety benchmarks focus on bias detection, toxicity levels, and ethical evaluations.

The paper details how each of these competencies is defined and measured using various benchmarks and metrics. It also explores how tasks associated with each competency can be combined to provide a composite evaluation, making it easier to add new tasks as the field evolves.

Lastly, the authors offer suggestions for future directions in LLM evaluation, emphasizing the need for dynamic, real-world applicable tasks and more comprehensive, transparent metrics that would accommodate the rapid development within the field. This holistic approach aims to standardize evaluations, drive improvements, and ensure LLMs’ performance aligns with real-world requirements.

Markdown