A Survey on Large Language Model Benchmarks (2508.15361v1)

Published 21 Aug 2025 in cs.CL

Abstract: In recent years, with the rapid development of the depth and breadth of LLMs' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of LLM benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper provides a systematic review categorizing LLM benchmarks into general capabilities, domain-specific, and target-specific aspects.
The paper identifies key challenges such as data leakage, cultural bias, and simplistic metrics that can inflate LLM performance.
The paper recommends dynamic, interactive evaluations and broader multilingual, multidisciplinary benchmarks to advance AI reliability.

A Survey on LLM Benchmarks (2508.15361)

This paper provides a comprehensive analysis of the landscape of evaluation benchmarks for LLMs, offering a systematic review while categorizing diverse benchmarks into general capabilities, domain-specific, and target-specific benchmarks.

Introduction to LLMs and Benchmarking

The introduction of the Transformer architecture marked a significant paradigm shift in AI, particularly in natural language processing. With exponential growth in the scale of LLMs, models such as the GPT and LLaMA series have permeated various sectors including customer service and healthcare. Despite these models' powerful capabilities, effective benchmarking systems are imperative to assess and guide their development. This survey evaluates 283 benchmarks, highlighting the challenges posed by data contamination, cultural bias, and lack of comprehensive evaluation metrics.

Figure 1: A timeline of representative LLM benchmarks.

Taxonomy of LLM Benchmarks

The paper classifies LLM benchmarks into three main categories: General Capabilities, Domain-Specific, and Target-Specific Benchmarks. Each category targets different aspects of LLM functionalities.

General Capabilities Benchmarks:
- Encompasses linguistics, knowledge, and reasoning benchmarks.
- Assess tasks like NLU, commonsense reasoning, and multilingual capabilities.
- Includes benchmarks such as GLUE, SuperGLUE, MMLU, and BIG-Bench.
Domain-Specific Benchmarks:
- Focuses on fields like natural sciences, humanities, social sciences, engineering, and technology.
- Differentiates between multi-disciplinarily and specialized subfields requiring expert-level knowledge.
Target-Specific Benchmarks:
- Anticipates and evaluates issues of safety, hallucination, robustness, and data leak.
- Benchmarks include HateCheck, ToxiGen, and RealtimeQA.

Challenges and Shortcomings

Despite increased diversity and rigor in LLM benchmarks, several persistent challenges remain. Current benchmarks suffer from data leakage, whereby model performance is inflated by exposure to test data during training. Static evaluations fail to imitate dynamic, real-world scenarios, while simplistic metrics inadequately describe LLMs' multifaceted abilities. The paper highlights the over-reliance on accuracy and BLEU scores in assessments, which fails to represent the intricacies of human language comprehension and generation.

Implications for Future Developments

The survey provides insights into emerging directions for future benchmark development:

Dynamic and Interactive Evaluations:
- To address data contamination and static test limitations, benchmarks such as LiveBench shift towards dynamic, real-time data and interactive testing environments.
Benchmark Composition and Diversity:
- Incorporating diverse languages, cultural contexts, and multidisciplinary perspectives is essential to develop universally applicable models. The expansion of multilingual benchmarks like Xtreme showcases progress but highlights the need for further language coverage.
Comprehensive Evaluation:
- Future benchmarks are encouraged to expand beyond language understanding, accommodating comprehensive problem-solving, ethical reasoning, and robust assessments of reliability and trustworthiness.

The survey concludes that benchmark innovation is vital for advancing LLM technology, improving evaluation paradigms, and ensuring relevant and responsible AI development. As LLMs integrate further into societal and industrial frameworks, the foundational role of sophisticated benchmarks becomes increasingly evident.

PDF Markdown

Follow-up Questions

Related Papers

Authors (14)

Tweets

https://twitter.com/StephenLCasper/status/1958835413860106533

https://twitter.com/arxivsanitybot/status/1959450397212287063

https://twitter.com/estate4/status/1959474031922663927

https://twitter.com/javaeeeee1/status/1959241442708095380

alphaXiv

A Survey on Large Language Model Benchmarks (12 likes, 0 questions)