CLUE: A Chinese Language Understanding Evaluation Benchmark (2004.05986v3)

Published 13 Apr 2020 in cs.CL and cs.LG

Abstract: The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in NLP. The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.CLUEbenchmarks.com

PDF Abstract

Analysis of "CLUE: A Chinese Language Understanding Evaluation Benchmark"

The paper "CLUE: A Chinese Language Understanding Evaluation Benchmark" introduces an important tool for the Chinese NLP community, addressing the deficiency of comprehensive benchmarks for Chinese when compared to established ones for English, like GLUE and SuperGLUE. CLUE provides a framework for evaluating a range of natural language understanding (NLU) tasks across Chinese datasets. It is aimed at fostering better cross-LLM performance evaluation and development, addressing the unique aspects of the Chinese language.

Core Contributions and Features

The benchmark comprises nine diverse tasks, covering single-sentence classification, sentence-pair classification, and machine reading comprehension, each probing different linguistic and semantic capabilities:

Single-Sentence Tasks: TNEWS, IFLYTEK, and CLUEWSC2020 test capabilities in text classification and coreference resolution.
Sentence Pair Tasks: AFQMC, CSL, and OCNLI focus on semantic similarity and inference, reflecting real-world needs like keyword recognition and multi-faceted textual inference.
Machine Reading Comprehension Tasks: CMRC 2018, ChID, and C³ require models to process longer texts and answer questions or fill gaps using contextual understanding.

CLUE additionally incorporates a diagnostic dataset created by Chinese linguists, specifically targeting linguistic phenomena, providing an opportunity for detailed analysis of a model's linguistic and reasoning skills.

Empirical Evaluation and Results

The authors evaluate various state-of-the-art pre-trained LLMs such as BERT, ERNIE, and RoBERTa on CLUE, establishing baseline performances. The models demonstrate varying effectiveness, with larger models and those utilizing whole word masking generally achieving better results. Notably, RoBERTa-wwm-ext-large consistently attained superior performance across multiple tasks. Despite these models achieving notable results, there persists a significant gap between machine and human ability, as evidenced by the human performance benchmarks included in the paper.

Implications for Chinese NLP Research

The introduction of CLUE addresses a critical need within the NLP community for robust and standardized evaluation tools specifically tailored for Chinese. This is essential given the linguistic complexity and the unique syntactic features of the Chinese language, which differ considerably from English and other Indo-European languages. The benchmark promises to facilitate cross-model comparisons and drive improvements in NLU models trained on Chinese data by providing a comprehensive set of tasks with varying difficulty and domains.

Future Developments

The CLUE benchmark sets the groundwork for future exploration into more sophisticated and specialized tasks as models mature. The inclusion of a large-scale pre-training corpus coupled with the benchmark pledges to standardize and elevate the performance evaluation framework within Chinese NLP research. Future developments could involve expanding the benchmark to incorporate even more diverse linguistic phenomena and larger datasets or exploring alternative evaluation metrics that consider model efficiency.

In conclusion, the CLUE benchmark represents a significant advance in Chinese NLU, promoting transparency and consistency in model evaluation and encouraging further research and development in this field. Its open-ended nature and community-driven approach ensure it remains relevant as the landscape of NLP continues to evolve.

PDF Markdown Bookmark Chat (Pro)

Authors (32)

Liang Xu (117 papers)
Hai Hu (23 papers)
Xuanwei Zhang (12 papers)
Lu Li (166 papers)
Chenjie Cao (28 papers)
Yudong Li (19 papers)
Yechen Xu (4 papers)
Kai Sun (317 papers)
Dian Yu (78 papers)
Cong Yu (81 papers)
Yin Tian (19 papers)
Qianqian Dong (19 papers)
Weitang Liu (14 papers)
Bo Shi (3 papers)
Yiming Cui (80 papers)
Junyi Li (92 papers)
Jun Zeng (70 papers)
Rongzhao Wang (2 papers)
Weijian Xie (10 papers)
Yanting Li (17 papers)

Citations (355)

View on Semantic Scholar