SuperCLUE: A Comprehensive Benchmark for Chinese LLMs
The paper introduces SuperCLUE, a comprehensive benchmark designed to evaluate the performance of Chinese LLMs across a wide spectrum of tasks. This benchmark seeks to address the existing gap where current evaluations primarily focus on model accuracy through multiple-choice questions, often neglecting user preferences and real-world applicability. SuperCLUE consists of three sub-tasks: CArena, OPEN, and CLOSE, each catering to different facets of model capabilities.
The CArena sub-task involves an interactive model battle platform, LangYa Leaderboard, which collects user interactions and ratings. Users interact with two anonymized LLMs and rate their responses, providing insights into user preferences based on actual usage scenarios. This sub-task encompasses a cross-section of diverse abilities such as semantic understanding, small talk, contextual conversation, and others. By correlating user ratings with model responses, it aims to capture the nuances of user satisfaction which conventional benchmarks often overlook.
The OPEN sub-task introduces open-ended questions, divided into single- and multiple-turn dialogues. These questions are carefully crafted to emulate real-world user queries and evaluate models' conversational capabilities. By emphasizing open-ended formats, this sub-task provides a more comprehensive exploration of LLMs' ability to conduct meaningful and context-aware dialogues, reflecting more accurately on their real-world applicability.
In contrast, the CLOSE sub-task transforms open-ended questions into closed-ended, multiple-choice formats. This sub-task allows for a conventional accuracy-based evaluation metric, providing a point of comparison against which the open-ended responses can be measured. While the closed-ended format is easier to evaluate, the authors argue that it does not fully capture the interactive and dynamic nature of user preferences.
The paper evaluates a variety of models using the SuperCLUE benchmark, including both globally renowned models like GPT-4 and Chinese-specific models such as MiniMax and ChatGLM2-6B. The results suggest that GPT-4 consistently outperforms other models across all metrics, highlighting a significant gap between international top-tier models and their Chinese counterparts. Interestingly, results from the OPEN sub-task show greater variability and discernment in model capabilities compared to the more uniform results from the CLOSE sub-task—indicating limitations of closed-ended formats in reflecting real-world interaction dynamics.
One of the key findings of the paper is the high correlation between GPT-4's ratings in open-ended scenarios and human evaluations, reinforcing the reliability of LLMs as evaluators in specific contexts. Additionally, the combination of open- and closed-ended evaluations provides a more accurate picture of user preferences and model applicability, something purely accuracy-driven benchmarks might miss. The results underscore the inadequacy of relying solely on closed-ended questions for evaluative purposes and suggest the complementary use of both formats for comprehensive assessments.
The implications of this research touch on both practical and theoretical dimensions. Practically, SuperCLUE offers a more nuanced toolset for developers to fine-tune LLMs in alignment with user preferences and operational contexts, particularly within the Chinese linguistic and cultural framework. Theoretically, it advances the discourse on holistic model evaluation, encouraging the development of benchmarks that integrate both quantitative and qualitative metrics.
Looking forward, the SuperCLUE benchmark sets a precedent for future evaluations by emphasizing real-world applicability and user-centric assessment. This approach could inspire the development of similar benchmarks in other languages and regions, fostering a more inclusive understanding of LLM performance across diverse linguistic and cultural landscapes. As AI continues to integrate into daily life, such comprehensive evaluations will be crucial in ensuring these systems meet user expectations and societal needs effectively.