Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation (2306.05783v3)

Published 9 Jun 2023 in cs.CL

Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of LLMs. We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.

Overview of "Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation"

The paper introduces Xiezhi, a comprehensive benchmark designed for evaluating the domain knowledge capabilities of LLMs. As the development of LLMs accelerates, there is a pressing need for benchmarks that can adequately measure the breadth and depth of these models' understanding across various knowledge domains. The authors present Xiezhi as a large-scale, multidimensional evaluation benchmark that aims to fill this gap by providing a robust framework to assess LLMs across numerous disciplines.

Xiezhi distinguishes itself by encompassing 249,587 multiple-choice questions derived from 516 disciplines across 13 categories such as philosophy, science, engineering, and more. This scale and diversity allow for an extensive assessment of LLMs, providing a broader understanding of their capabilities and limitations. The benchmark is constructed using a combination of manually annotated questions and automatically generated and labeled content, ensuring a comprehensive and continuously updating evaluation framework.

Key Contributions and Findings

  1. Comprehensive Coverage: Xiezhi includes questions from 516 disciplines categorized into 13 distinct fields. This makes it one of the most comprehensive benchmarks for domain knowledge evaluation, covering areas like natural sciences, humanities, engineering, and more.
  2. Automatic Updates: To keep pace with the rapidly evolving training data of LLMs, Xiezhi integrates automatic updates. This ensures that the benchmark remains relevant and challenging, thus providing a more accurate measure of a model's current capabilities.
  3. Evaluation Methodology: The authors propose a novel evaluation methodology where performance is measured using a higher number of answer options (50 per question) compared to traditional benchmarks that typically use four options. This approach is designed to reduce the impact of random guessing and provide a clearer picture of a model's true understanding.
  4. Quantified Performance Gaps: The results from testing 47 LLMs reveal performance trends and disparities. Notably, state-of-the-art LLMs surpass average human practitioners in fields like science and engineering while still lagging in areas such as law and literature.
  5. Open Source and Accessibility: All evaluation code and data are made public, promoting transparency and enabling further research by providing a shared resource for the community.

Implications and Future Directions

The introduction of Xiezhi has several implications for both the development and evaluation of LLMs:

  • Benchmark Longevity: By incorporating a self-updating mechanism, Xiezhi can maintain its relevance longer than static benchmarks, which often become outdated as they are incorporated into model training datasets.
  • Comprehensive Skills Assessment: The breadth of disciplines included allows for a more detailed understanding of LLM capabilities, potentially highlighting areas where advanced models excel or need improvement.
  • Informed Model Development: Insights gained from Xiezhi can inform researchers about the strengths and weaknesses of existing LLMs, guiding the development of more balanced and capable models in areas where they currently underperform.

In terms of future developments, one significant direction is further expanding the cultural and linguistic diversity of the benchmark. As noted in the paper, the current version has a distinct focus on Chinese academic content, which may not fully represent global perspectives across all disciplines. Additionally, exploring alternative assessment metrics beyond multiple-choice questions could provide a more nuanced understanding of LLMs' reasoning and comprehension abilities.

In conclusion, Xiezhi represents a substantial advance in the tools available for evaluating LLMs, providing a detailed and scalable approach to assessing domain knowledge. This benchmark not only aids in benchmarking current models but also sets a standard for future developments in AI evaluation, ensuring that these models can be rigorously tested across a diverse range of knowledge areas.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Pythia: A suite for analyzing large language models across training and scaling.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  5. Gpt-neox-20b: An open-source autoregressive language model.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  10. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  11. How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion.
  12. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  13. H2O.ai (2023). h2ogpt - the world’s best open source gpt. https://github.com/h2oai/h2ogpt.
  14. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
  15. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  17. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
  18. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277.
  19. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  20. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
  21. Towards better instruction following language models for chinese: Investigating the impact of training data and evaluation. arXiv preprint arXiv:2304.07854.
  22. Openassistant conversations – democratizing large language model alignment.
  23. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  24. M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models. arXiv preprint arXiv:2305.10263.
  25. Crosslingual generalization through multitask finetuning.
  26. Lextreme: A multi-lingual and multi-task benchmark for the legal domain. arXiv preprint arXiv:2301.13126.
  27. OpenAI (2023a). Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
  28. OpenAI (2023b). Gpt-4 technical report.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  32. A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39.
  33. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  35. StabilityAI (2023). Stablelm: Stability ai language models. https://github.com/Stability-AI/StableLM.
  36. Fudannlp moss.
  37. Moss: An open-source tool-augmented conversational language model from fudan university. https://github.com/OpenLMLab/MOSS.
  38. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  39. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Alpaca-lora. https://github.com/tloen/alpaca-lora.
  42. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  43. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  44. Doctorglm: Fine-tuning your chinese doctor is not a herculean task.
  45. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  46. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  47. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR).
  48. Zeng, H. (2023). Measuring massive multitask chinese understanding. arXiv preprint arXiv:2304.12986.
  49. A survey of large language models. arXiv preprint arXiv:2303.18223.
  50. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  51. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419.
  52. Neural correction model for open-domain named entity recognition. arXiv preprint arXiv:1909.06058.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Zhouhong Gu (23 papers)
  2. Xiaoxuan Zhu (10 papers)
  3. Haoning Ye (6 papers)
  4. Lin Zhang (342 papers)
  5. Jianchen Wang (2 papers)
  6. Sihang Jiang (13 papers)
  7. Zhuozhi Xiong (3 papers)
  8. Zihan Li (56 papers)
  9. Qianyu He (26 papers)
  10. Rui Xu (198 papers)
  11. Wenhao Huang (98 papers)
  12. Zili Wang (52 papers)
  13. Shusen Wang (35 papers)
  14. Weiguo Zheng (10 papers)
  15. Hongwei Feng (16 papers)
  16. Yanghua Xiao (151 papers)
  17. Yixin Zhu (102 papers)
  18. Weijie Wu (29 papers)
  19. Jingping Liu (18 papers)
Citations (44)
Github Logo Streamline Icon: https://streamlinehq.com