Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation (2306.05783v3)

Published 9 Jun 2023 in cs.CL

Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of LLMs. We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.

PDF HTML Abstract

Overview of "Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation"

The paper introduces Xiezhi, a comprehensive benchmark designed for evaluating the domain knowledge capabilities of LLMs. As the development of LLMs accelerates, there is a pressing need for benchmarks that can adequately measure the breadth and depth of these models' understanding across various knowledge domains. The authors present Xiezhi as a large-scale, multidimensional evaluation benchmark that aims to fill this gap by providing a robust framework to assess LLMs across numerous disciplines.

Xiezhi distinguishes itself by encompassing 249,587 multiple-choice questions derived from 516 disciplines across 13 categories such as philosophy, science, engineering, and more. This scale and diversity allow for an extensive assessment of LLMs, providing a broader understanding of their capabilities and limitations. The benchmark is constructed using a combination of manually annotated questions and automatically generated and labeled content, ensuring a comprehensive and continuously updating evaluation framework.

Key Contributions and Findings

Comprehensive Coverage: Xiezhi includes questions from 516 disciplines categorized into 13 distinct fields. This makes it one of the most comprehensive benchmarks for domain knowledge evaluation, covering areas like natural sciences, humanities, engineering, and more.
Automatic Updates: To keep pace with the rapidly evolving training data of LLMs, Xiezhi integrates automatic updates. This ensures that the benchmark remains relevant and challenging, thus providing a more accurate measure of a model's current capabilities.
Evaluation Methodology: The authors propose a novel evaluation methodology where performance is measured using a higher number of answer options (50 per question) compared to traditional benchmarks that typically use four options. This approach is designed to reduce the impact of random guessing and provide a clearer picture of a model's true understanding.
Quantified Performance Gaps: The results from testing 47 LLMs reveal performance trends and disparities. Notably, state-of-the-art LLMs surpass average human practitioners in fields like science and engineering while still lagging in areas such as law and literature.
Open Source and Accessibility: All evaluation code and data are made public, promoting transparency and enabling further research by providing a shared resource for the community.

Implications and Future Directions

The introduction of Xiezhi has several implications for both the development and evaluation of LLMs:

Benchmark Longevity: By incorporating a self-updating mechanism, Xiezhi can maintain its relevance longer than static benchmarks, which often become outdated as they are incorporated into model training datasets.
Comprehensive Skills Assessment: The breadth of disciplines included allows for a more detailed understanding of LLM capabilities, potentially highlighting areas where advanced models excel or need improvement.
Informed Model Development: Insights gained from Xiezhi can inform researchers about the strengths and weaknesses of existing LLMs, guiding the development of more balanced and capable models in areas where they currently underperform.

In terms of future developments, one significant direction is further expanding the cultural and linguistic diversity of the benchmark. As noted in the paper, the current version has a distinct focus on Chinese academic content, which may not fully represent global perspectives across all disciplines. Additionally, exploring alternative assessment metrics beyond multiple-choice questions could provide a more nuanced understanding of LLMs' reasoning and comprehension abilities.

In conclusion, Xiezhi represents a substantial advance in the tools available for evaluating LLMs, providing a detailed and scalable approach to assessing domain knowledge. This benchmark not only aids in benchmarking current models but also sets a standard for future developments in AI evaluation, ensuring that these models can be rigorously tested across a diverse range of knowledge areas.

PDF Markdown Bookmark Chat (Pro)

References (52)

Authors (19)

Zhouhong Gu (23 papers)
Xiaoxuan Zhu (10 papers)
Haoning Ye (6 papers)
Lin Zhang (342 papers)
Jianchen Wang (2 papers)
Sihang Jiang (13 papers)
Zhuozhi Xiong (3 papers)
Zihan Li (56 papers)
Qianyu He (26 papers)
Rui Xu (198 papers)
Wenhao Huang (98 papers)
Zili Wang (52 papers)
Shusen Wang (35 papers)
Weiguo Zheng (10 papers)
Hongwei Feng (16 papers)
Yanghua Xiao (151 papers)
Yixin Zhu (102 papers)
Weijie Wu (29 papers)
Jingping Liu (18 papers)

Citations (44)

View on Semantic Scholar

GitHub

GitHub - MikeGu721/XiezhiBenchmark (97 stars)

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation (2306.05783v3)

Overview of "Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation"

Key Contributions and Findings

Implications and Future Directions

Related Papers

GitHub