Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 26 tok/s Pro

GPT-4o 98 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

A Survey on Large Language Model Benchmarks (2508.15361v1)

Published 21 Aug 2025 in cs.CL

Abstract: In recent years, with the rapid development of the depth and breadth of LLMs' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of LLM benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

Collections

Summary

The paper surveys 283 LLM benchmarks, categorizing them into general, domain-specific, and target-specific evaluations to capture evolution and challenges.
It highlights methodological innovations such as adversarial filtering and multilingual testing while addressing issues like data contamination and cultural bias.
The study calls for unified, dynamic frameworks that align model assessment with real-world deployment and responsible, robust behavior.

A Comprehensive Survey of LLM Benchmarks

Introduction

The paper "A Survey on LLM Benchmarks" (2508.15361) provides an exhaustive and systematic review of the landscape of benchmarks for evaluating LLMs. The authors categorize 283 benchmarks into three principal classes: general capabilities, domain-specific, and target-specific. The survey not only traces the historical evolution and taxonomy of LLM benchmarks but also critically analyzes their design motivations, methodological innovations, and persistent limitations. The work highlights the centrality of benchmarks in both measuring and shaping LLM development, and it identifies key challenges such as data contamination, cultural bias, and the lack of dynamic or process-oriented evaluation.

Figure 1: A timeline of representative LLM benchmarks, illustrating the rapid evolution and diversification of evaluation paradigms.

Taxonomy and Evolution of LLM Benchmarks

The survey introduces a hierarchical taxonomy that organizes benchmarks by their primary evaluation focus: general capabilities (linguistic, knowledge, reasoning), domain-specific (natural sciences, humanities/social sciences, engineering/technology), and target-specific (risk/reliability, agentic behavior, others). This taxonomy reflects the increasing breadth and granularity of LLM evaluation, moving from early single-task, monolingual leaderboards to multidimensional, multilingual, and scenario-driven frameworks.

The historical analysis reveals several distinct phases:

Fragmented Task Evaluation to Unified Frameworks: Early benchmarks such as GLUE and SuperGLUE focused on natural language understanding (NLU) via single-task or multi-task settings, but were quickly saturated by advances in model architecture and scale.
Adversarial and Multilingual Expansion: The emergence of adversarially filtered datasets (e.g., HellaSwag, WinoGrande) and multilingual benchmarks (e.g., Xtreme, MDIA) exposed the brittleness of models to dataset artifacts and linguistic diversity.
Generative and Holistic Assessment: The shift to generative models necessitated new metrics (BERTScore, BLEURT, Bartscore) and holistic frameworks (HELM, BIG-Bench) that assess a broader spectrum of capabilities, including open-ended generation, robustness, and fairness.
Dynamic and Agentic Evaluation: Recent benchmarks target dynamic, interactive, and agentic behaviors, reflecting the deployment of LLMs in real-world, multi-turn, and tool-augmented scenarios.

General Capabilities Benchmarks

Linguistic Core

Benchmarks in this category have evolved from static, monolingual datasets to dynamic, multilingual, and context-sensitive frameworks. The survey details the progression from GLUE/SuperGLUE to adversarial and multilingual benchmarks, and finally to LLM-as-judge paradigms (e.g., MT-Bench, BiGGenBench) that leverage LLMs for fine-grained, context-aware evaluation. The authors emphasize the persistent challenges of typological bias, resource asymmetry, and the epistemological risks of self-referential evaluation.

Knowledge

Knowledge-oriented benchmarks have shifted from open-domain QA with external context to closed-book, multi-task, and human-centric evaluations (e.g., MMLU, GPQA, SuperGPQA). The survey highlights the methodological dominance of MCQA formats, the emergence of "Google-Proof" and hierarchical knowledge assessments, and the expansion to multilingual and multimodal settings. The authors underscore the critical issue of data contamination and the limitations of static, culturally narrow benchmarks.

Reasoning

Reasoning benchmarks are categorized into logical, specialized/commonsense, and applied/contextual reasoning. The survey documents the increasing complexity and diversity of reasoning tasks, from formal logic (RuleTaker, LogicBench) to causal, mathematical, and analogical reasoning (Corr2Cause, GSM-Symbolic, ANALOGICAL), and finally to real-world, multi-hop, and interactive scenarios (HotpotQA, LiveBench, TextGames). The analysis identifies a trend toward dynamic, process-oriented, and explainable evaluation, as well as the need for multilingual and tool-integrated reasoning benchmarks.

Domain-Specific Benchmarks

Natural Sciences

Benchmarks in mathematics, physics, chemistry, and biology have become increasingly specialized and rigorous, with a focus on abstract reasoning, symbolic manipulation, and multimodal integration. The survey details the progression from grade-school math (GSM8K) to Olympiad and graduate-level problems (Omni-MATH, FrontierMath), the incorporation of vision-essential and simulation-based physics tasks (PhysUniBench, FEABench), and the emergence of safety and ethical evaluation in chemistry (ChemSafetyBench). Biology benchmarks are noted for their emphasis on pathway reasoning and graph-augmented evaluation (BioMaze).

Legal, IP, education, psychology, and finance benchmarks are analyzed in terms of task taxonomy, dataset construction, and evaluation methodology. The survey notes the adoption of cognitive taxonomies (e.g., Bloom's, DOK) for legal and IP tasks, the simulation of real-world educational and psychological scenarios, and the need for fine-grained, scenario-oriented, and multimodal evaluation. The lack of multilingual and cross-jurisdictional benchmarks is identified as a significant gap.

Engineering and Technology

Software engineering benchmarks have matured from function-level code generation (HumanEval, MBPP) to repository-level, class-level, and domain-specific tasks (SWE-bench, ClassEval, MMCode). The survey also covers benchmarks for hardware design, power systems, and aerospace, emphasizing the importance of execution-based, safety-critical, and workflow-integrated evaluation. The authors call for holistic, system-level, and human-AI collaborative benchmarks to bridge the gap between benchmark performance and real-world deployment.

Target-Specific Benchmarks

Risk and Reliability

The survey provides a detailed taxonomy of benchmarks for safety, hallucination, robustness, and data leakage. Safety benchmarks have evolved from static, English-only datasets to adversarial, multilingual, and dynamic frameworks (JailbreakBench, SG-Bench, HarmBench). Hallucination benchmarks address both factual and faithfulness errors, with a focus on domain-specific and adversarial scenarios (TruthfulQA, FActScore, MedHallu). Robustness benchmarks target adversarial, instruction-following, and prompt-invariant evaluation (AdvGLUE, IFEval, PromptRobust). Data leakage benchmarks assess privacy risks and PII exposure (WikiMIA, C $^2$ LEVA). The authors highlight the need for unified, dynamic, and system-level risk assessment protocols.

Agentic Evaluation

Agent benchmarks are organized into specific capability, integrated capability, domain proficiency, and safety/risk evaluation. The survey documents the emergence of planning, tool-use, multi-agent, and embodied benchmarks (FlowBench, MultiAgentBench, Embodied Agent Interface), as well as domain-specific and safety-critical agent tasks (AgentClinic, AgentHarm, SafeAgentBench). The analysis emphasizes the necessity of compositional, continuous, and robustness-oriented evaluation for trustworthy agentic autonomy.

Other Human-Centered Benchmarks

The survey also covers benchmarks for cultural adaptability, emotional intelligence, value alignment, and real-world task execution (CDEval, EmotionQueen, Shopping MMLU, TP-RAG). These benchmarks reflect the trend toward multidimensional, socially-aware, and application-driven evaluation.

Key Challenges and Future Directions

The survey identifies several persistent and emerging challenges:

Data Contamination: The prevalence of benchmark data in pretraining corpora undermines the validity of static evaluations, necessitating dynamic, contamination-resistant benchmarks.
Cultural and Linguistic Bias: The dominance of English and Western-centric benchmarks distorts performance measurement and limits global applicability.
Static vs. Dynamic Evaluation: Static, single-turn benchmarks fail to capture the dynamic, interactive, and process-oriented behaviors required in real-world deployment.
Evaluation Methodology: Over-reliance on MCQA and accuracy metrics fails to assess reasoning processes, explainability, and uncertainty.
Resource Asymmetry: High computational and data requirements for comprehensive evaluation perpetuate exclusion and limit accessibility.
Agentic and System-Level Assessment: The integration of LLMs into complex, multi-agent, and tool-augmented systems demands new benchmarks for compositionality, robustness, and human-AI collaboration.

The authors advocate for the development of unified, dynamic, and causally-informative evaluation frameworks that align technical rigor with societal values, inclusivity, and robustness.

Conclusion

This survey provides a comprehensive mapping of the LLM benchmark landscape, tracing the evolution from task-specific leaderboards to multidimensional, scenario-driven, and risk-aware frameworks. The analysis reveals fundamental tensions between breadth and depth, generality and specialization, and technical and societal objectives. As LLMs become integral to sociotechnical systems, the design of benchmarks must evolve to capture not only what models can do, but how they should perform responsibly, robustly, and inclusively in dynamic, high-stakes environments. The future of LLM evaluation will require cross-disciplinary collaboration, methodological innovation, and a commitment to aligning model assessment with real-world impact.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (14)

Tweets

https://twitter.com/StephenLCasper/status/1958835413860106533

https://twitter.com/arxivsanitybot/status/1959450397212287063

https://twitter.com/estate4/status/1959474031922663927

https://twitter.com/javaeeeee1/status/1959241442708095380

alphaXiv

A Survey on Large Language Model Benchmarks (75 likes, 0 questions)