SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark (2307.15020v1)

Published 27 Jul 2023 in cs.CL and cs.AI

Abstract: LLMs have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com

PDF Abstract

SuperCLUE: A Comprehensive Benchmark for Chinese LLMs

The paper introduces SuperCLUE, a comprehensive benchmark designed to evaluate the performance of Chinese LLMs across a wide spectrum of tasks. This benchmark seeks to address the existing gap where current evaluations primarily focus on model accuracy through multiple-choice questions, often neglecting user preferences and real-world applicability. SuperCLUE consists of three sub-tasks: CArena, OPEN, and CLOSE, each catering to different facets of model capabilities.

The CArena sub-task involves an interactive model battle platform, LangYa Leaderboard, which collects user interactions and ratings. Users interact with two anonymized LLMs and rate their responses, providing insights into user preferences based on actual usage scenarios. This sub-task encompasses a cross-section of diverse abilities such as semantic understanding, small talk, contextual conversation, and others. By correlating user ratings with model responses, it aims to capture the nuances of user satisfaction which conventional benchmarks often overlook.

The OPEN sub-task introduces open-ended questions, divided into single- and multiple-turn dialogues. These questions are carefully crafted to emulate real-world user queries and evaluate models' conversational capabilities. By emphasizing open-ended formats, this sub-task provides a more comprehensive exploration of LLMs' ability to conduct meaningful and context-aware dialogues, reflecting more accurately on their real-world applicability.

In contrast, the CLOSE sub-task transforms open-ended questions into closed-ended, multiple-choice formats. This sub-task allows for a conventional accuracy-based evaluation metric, providing a point of comparison against which the open-ended responses can be measured. While the closed-ended format is easier to evaluate, the authors argue that it does not fully capture the interactive and dynamic nature of user preferences.

The paper evaluates a variety of models using the SuperCLUE benchmark, including both globally renowned models like GPT-4 and Chinese-specific models such as MiniMax and ChatGLM2-6B. The results suggest that GPT-4 consistently outperforms other models across all metrics, highlighting a significant gap between international top-tier models and their Chinese counterparts. Interestingly, results from the OPEN sub-task show greater variability and discernment in model capabilities compared to the more uniform results from the CLOSE sub-task—indicating limitations of closed-ended formats in reflecting real-world interaction dynamics.

One of the key findings of the paper is the high correlation between GPT-4's ratings in open-ended scenarios and human evaluations, reinforcing the reliability of LLMs as evaluators in specific contexts. Additionally, the combination of open- and closed-ended evaluations provides a more accurate picture of user preferences and model applicability, something purely accuracy-driven benchmarks might miss. The results underscore the inadequacy of relying solely on closed-ended questions for evaluative purposes and suggest the complementary use of both formats for comprehensive assessments.

The implications of this research touch on both practical and theoretical dimensions. Practically, SuperCLUE offers a more nuanced toolset for developers to fine-tune LLMs in alignment with user preferences and operational contexts, particularly within the Chinese linguistic and cultural framework. Theoretically, it advances the discourse on holistic model evaluation, encouraging the development of benchmarks that integrate both quantitative and qualitative metrics.

Looking forward, the SuperCLUE benchmark sets a precedent for future evaluations by emphasizing real-world applicability and user-centric assessment. This approach could inspire the development of similar benchmarks in other languages and regions, fostering a more inclusive understanding of LLM performance across diverse linguistic and cultural landscapes. As AI continues to integrate into daily life, such comprehensive evaluations will be crucial in ensuring these systems meet user expectations and societal needs effectively.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Liang Xu (117 papers)
Anqi Li (70 papers)
Lei Zhu (280 papers)
Hang Xue (9 papers)
Changtai Zhu (3 papers)
Kangkang Zhao (8 papers)
Haonan He (10 papers)
Xuanwei Zhang (12 papers)
Qiyue Kang (1 paper)
Zhenzhong Lan (56 papers)

Citations (37)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ogawa_tter/status/1929010767644242255

YouTube

Show All Videos