Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating the Performance of Large Language Models on GAOKAO Benchmark (2305.12474v3)

Published 21 May 2023 in cs.CL and cs.AI

Abstract: LLMs(LLMs) have demonstrated remarkable performance across various natural language processing tasks; however, how to comprehensively and accurately assess their performance becomes an urgent issue to be addressed. This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples, including both subjective and objective questions. To align with human examination methods, we design a method based on zero-shot settings to evaluate the performance of LLMs. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.Our findings reveal that LLMs have achieved competitive scores in Chinese GAOKAO examination, while they exhibit significant performance disparities across various subjects. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores. In conclusion, this research contributes a robust evaluation benchmark for future LLMs and offers valuable insights into the advantages and limitations of such models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaotian Zhang (35 papers)
  2. Chunyang Li (19 papers)
  3. Yi Zong (4 papers)
  4. Zhengyu Ying (1 paper)
  5. Liang He (202 papers)
  6. Xipeng Qiu (257 papers)
Citations (70)