Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models (2308.14353v1)

Published 28 Aug 2023 in cs.CL

Abstract: The unprecedented performance of LLMs requires comprehensive and accurate evaluation. We argue that for LLMs evaluation, benchmarks need to be comprehensive and systematic. To this end, we propose the ZhuJiu benchmark, which has the following strengths: (1) Multi-dimensional ability coverage: We comprehensively evaluate LLMs across 7 ability dimensions covering 51 tasks. Especially, we also propose a new benchmark that focuses on knowledge ability of LLMs. (2) Multi-faceted evaluation methods collaboration: We use 3 different yet complementary evaluation methods to comprehensively evaluate LLMs, which can ensure the authority and accuracy of the evaluation results. (3) Comprehensive Chinese benchmark: ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. (4) Avoiding potential data leakage: To avoid data leakage, we construct evaluation data specifically for 37 tasks. We evaluate 10 current mainstream LLMs and conduct an in-depth discussion and analysis of their results. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/ and we also provide a demo video at https://youtu.be/qypkJ89L1Ic.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Baoli Zhang (1 paper)
  2. Haining Xie (2 papers)
  3. Pengfan Du (2 papers)
  4. Junhao Chen (36 papers)
  5. Pengfei Cao (39 papers)
  6. Yubo Chen (58 papers)
  7. Shengping Liu (21 papers)
  8. Kang Liu (207 papers)
  9. Jun Zhao (469 papers)
Citations (1)