CLUE Framework: Chinese NLU Benchmark

Updated 4 October 2025

CLUE is a comprehensive Chinese NLU benchmark offering rich datasets and multi-dimensional evaluation tasks such as classification, inference, and machine reading comprehension.
It integrates extensive pre-training corpora and standardized evaluation toolkits, ensuring reproducible research and fair model comparisons.
The framework is designed to address unique Chinese language phenomena, revealing model weaknesses in linguistic reasoning and semantic generalization.

CLUE, in its various incarnations, denotes multiple frameworks and algorithms spanning language understanding benchmarks, explainable uncertainty methods, active learning strategies, blockchain analytics, particle physics clustering, robot anomaly detection, clinical/evaluation toolkits, and forensic, multimodal, and neural calibration domains. The following article presents a systematic overview of CLUE as defined in "CLUE: A Chinese Language Understanding Evaluation Benchmark" (Xu et al., 2020), mapping out its architecture, methodology, evaluated models, resource design, and scientific impact within Chinese NLP.

1. Foundation and Conceptual Scope

The CLUE framework is structured as a multi-task benchmark for Chinese Natural Language Understanding (NLU), explicitly paralleling established English-centric benchmarks such as GLUE and SuperGLUE. Its core objectives are to provide:

Broad-coverage empirical evaluation for Chinese NLU models.
Datasets and task diversity that capture both generic linguistic capability and Chinese-specific phenomena.
High-quality pre-training corpora and evaluation toolkits for reproducible research.

CLUE amalgamates tasks across three analytic dimensions: single-sentence classification, sentence-pair discrimination, and machine reading comprehension (MRC), all based on original Chinese text. A supplementary diagnostic dataset focuses on probing models' grasp of core linguistic and logical structures prevalent in Mandarin.

2. Task Suite and Data Characteristics

CLUE’s benchmark is composed of nine principal tasks, collectively sampling both fundamental and linguistically nuanced requirements in Chinese NLU:

Subcategory	Task(s) / Phenomenon	Description / Input Type
Single Sentence	TNEWS	News title classification
	IFLYTEK	App description classification
	CLUEWSC2020	Winograd Schema Challenge (anaphora)
Sentence Pair	AFQMC	Semantic similarity assessment
	CSL	Keyword judgment in academic abstracts
	OCNLI	Natural language inference (native)
Machine Reading	CMRC 2018	Span extraction from Wikipedia
Comprehension	ChID	Cloze test with Chinese idioms
	C³	Multiple-choice comprehension

Tasks utilize authentic Chinese text, with the MRC tasks—such as ChID—expressly designed to highlight unique idiomatic and syntactic features (e.g., four-character idioms). The diagnostic NLI dataset includes entailment/contradiction/neutral samples, selectively targeting aspect markers, lexical semantics, comparatives, and anaphora. These diagnostic examples are hand-crafted by linguistic experts to uncover systematic model failure modes, such as reliance on heuristics in the face of complex lexical phenomena.

3. Pre-training Corpus and Supplementary Resources

Recognizing the crucial role of large-scale pre-training in deep learning, CLUE furnishes corpus resources approximately 214 GB in size, amounting to 76 billion Chinese words. It aggregates from:

CLUECorpus2020-small: News, WebText QA, Wikipedia, and e-commerce comments.
CLUECorpus2020: Extensive Common Crawl data, directly preprocessed.
CLUEOSCAR: Filtered subset from the multilingual OSCAR corpus.

CLUE’s resource stack includes PyCLUE, a toolkit implemented in TensorFlow and PyTorch to streamline model evaluation for all benchmark tasks. A leaderboard infrastructure supports open submissions, with “certified” status for run-reproducible and publicly released models, maintaining the integrity of cross-model comparisons.

4. Evaluated Model Architectures and Metrics

CLUE tests nine transformer-based models, engaging diverse architectural design choices and parameter scales:

BERT–base / BERT-wwm-ext–base: Whole word masking (WWM) improves Chinese text modeling.
ALBERT-tiny / ALBERT-xxlarge: Model size scaling effects.
ERNIE–base: Integration of knowledge graph signals.
XLNet–mid: Autoregressive pretraining with SentencePiece.
RoBERTa–large / RoBERTa-wwm-ext–base / RoBERTa-wwm-ext–large: WWM on extended corpora.

Evaluation uses standard task-specific metrics: accuracy for classification/multiple-choice, Exact Match (EM) for extractive MRC, and average task scores (arithmetic mean or weighted by task difficulty). Model fine-tuning is meticulously controlled: batch size, sequence length, learning rate, and epoch count are held consistent for fair comparison across parameter regimes.

5. Design Rationale and Linguistic Significance

CLUE’s configuration is not a mere translation of English-centric benchmarks but rather an adaptation tailored toward Chinese. Notably, Chinese text lacks explicit word boundaries and features idioms, aspect markers, and other phenomena absent from English. CLUE encompasses both direct analogs of English NLU tasks (e.g., OCNLI mimics MNLI) and novel challenge sets such as ChID for idiom cloze tests. The diagnostic NLI, constructed by professional linguists, is essential for distinguishing genuine linguistic comprehension from spurious accuracy arising due to dataset artifacts.

This benchmark fills critical gaps in the ecosystem for Chinese NLP research, previously hamstrung by limited high-quality data and absence of standardized evaluation. As a community-driven, open-ended platform, CLUE fosters collaborative improvements, extending the blueprint of GLUE and SuperGLUE to a linguistically distinct domain.

6. Implementation and Benchmarking Considerations

CLUE’s methodology incorporates:

Consistent hyperparameter scheduling across models to isolate architecture differences.
Systematic use of PyCLUE for reproducible training and validation.
Public leaderboard with transparency regarding reproducibility and code availability.

Compute requirements scale with model size and corpus breadth, but are standard for modern transformer-based workflows. Task architectures employ conventional fine-tuning heads: classifier for single-sentence and sentence-pair inputs, span extractors for MRC. The pre-training corpus enables models to capture statistical nuances of Chinese while PyCLUE abstracts much of the engineering for practitioners.

7. Impact and Scientific Utility

CLUE’s provision of diagnostic datasets, high-quality benchmarks, and reproducible leaderboards has spearheaded a new era in Chinese NLU, akin to GLUE’s role in English. Evaluation reveals that while modern models (BERT, RoBERTa, ALBERT) perform admirably on surface-level classification, they remain susceptible to errors in linguistic reasoning and semantic generalization, highlighting a persistent gap from human performance.

The framework renders possible systematic, empirical investigation of modeling strategies, word-level phenomenon, and cross-linguistic generalization, and positions itself as a reference point in Chinese NLP scholarship. By facilitating head-to-head comparisons under rigorous controls, CLUE sets a robust precedent for ongoing advances in both foundational and applied linguistic technology for Chinese.

In summary, CLUE as a Chinese NLU benchmark delivers a multi-task, diagnostic-rich platform with meticulously designed resources, architecture evaluation, and scientific infrastructure, enabling reproducible, linguistically meaningful assessment of state-of-the-art models for both academic research and real-world deployment (Xu et al., 2020).

PDF Markdown Chat (Pro)

References (1)

CLUE: A Chinese Language Understanding Evaluation Benchmark (2020)

Follow Topic

Get notified by email when new papers are published related to CLUE Framework.