Papers
Topics
Authors
Recent
2000 character limit reached

CLUE Benchmark for Classification Tasks

Updated 3 December 2025
  • CLUE benchmarks are standardized testbeds that evaluate classification models in Chinese natural language understanding and clinical NLP using well-defined data splits and metrics.
  • The Chinese CLUE suite covers multiple tasks—including single-sentence and sentence-pair classifications—with annotations from expert judgments and crowdsourcing, enhancing comparability.
  • The Clinical CLUE benchmark focuses on computational phenotyping and mortality prediction, employing micro-F1 and macro-F1 metrics to address class imbalance and ensure balanced evaluation.

The CLUE benchmark for classification tasks refers to several rigorous, standardized evaluation suites for natural language understanding (NLU) and for clinical language processing. The “CLUE” term is used for two distinct resources: (1) the Chinese Language Understanding Evaluation benchmark for Chinese NLU (Xu et al., 2020), and (2) the Clinical Language Understanding Evaluation benchmark for clinical NLP (Goodwin et al., 2022). Additionally, CLUES (Menon et al., 2022) targets classifier learning using natural language explanations for structured data. This article focuses strictly on the definition, design, methodologies, results, and significance of the CLUE benchmarks regarded as authoritative, reproducible testbeds for classification tasks.

1. Benchmark Definitions and Scope

CLUE benchmarks establish standardized environments for evaluating classification models across language, clinical, and structured-data domains.

  • Chinese CLUE (Xu et al., 2020): A large-scale evaluation framework for Chinese NLU, aggregating six core classification tasks (three single-sentence, three sentence-pair) sourced from news, app descriptions, scientific abstracts, and natural language inference.
  • Clinical CLUE (Goodwin et al., 2022): A suite of six clinical NLP tasks, including disease staging, computational phenotyping, mortality prediction, and length-of-stay regression, derived from the MIMIC-III corpus. Only computational phenotyping and mortality prediction are formulated as classification tasks.
  • Both benchmarks promote comparability, reproducibility, and methodological rigor analogous to GLUE/SuperGLUE for English, addressing data fragmentation and inconsistencies in prior evaluation schemes.

2. Task Composition and Data Splits

Chinese CLUE Classification Tasks

Task #Train #Dev #Test Format #Labels
TNEWS 53,300 10,000 10,000 Single-sentence 15
IFLYTEK 12,100 2,600 2,600 Single-sentence 119
CLUEWSC2020 1,244 304 290 Single-sentence 2
AFQMC 34,300 4,300 3,900 Sentence-pair 2
CSL 20,000 3,000 3,000 Sentence-pair 2
OCNLI 50,000 3,000 3,000 Sentence-pair 3
  • Task formats include single-sentence classification (topic or domain prediction, coreference resolution) and sentence-pair tasks (semantic equivalence, entailment).

Clinical CLUE Classification Tasks

Task #Classes Split Ratio Metric Type
Computational Phenotyping ~200–300 multi-label 8:1:1:1 (Train/Dev/Cal/Test) Micro-F1
Mortality Prediction 7 multi-label binary 8:1:1:1 Macro-F1
  • Computational phenotyping: Multi-label assignment using CCS (Clinical Classification Software) groupings based on ICD-9 code presence.
  • Mortality prediction: Seven independent binary labels marking mortality risk at specific clinical horizons (24h, 48h, 72h, 10d, 30d, 90d, 1yr).
  • Data origins: MIMIC-III’s corpus with 2,082,284 notes from 46,520 patients; splits are stratified by patient and preserve confounder distributions (age, sex, race, ICU type, admission parameters).

3. Label Annotation and Ontologies

  • Chinese CLUE labels are annotated via manual categorization, expert linguistic judgments, and crowdsourcing for semantic or co-reference tasks. Categories are curated to maintain a minimum instance count, with negative instances generated using distractor sampling (CSL) or filtering to increase difficulty (TNEWS, IFLYTEK).
  • Clinical CLUE utilizes structured mapping (ICD-9 to CCS) for phenotyping and timestamped mortality flags for risk prediction, focusing on clinical applicability and reproducibility by excluding admissions with insufficient narrative data or younger patients.

4. Evaluation Metrics

All benchmarks employ strict metric definitions for model comparison, emphasizing both overall and per-label performance, with precise mathematical formulation:

Chinese CLUE Metrics

  • Accuracy:

Acc=1Ni=1N1(y^i=yi)\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat y_i = y_i)

  • Macro-F1 (optional):

F1macro=1Kk=1KF1k\mathrm{F1}_{\mathrm{macro}} = \frac{1}{K}\sum_{k=1}^K \mathrm{F1}_k

Clinical CLUE Metrics

  • Micro-F1 (phenotyping): Aggregates TP, FP, FN across all labels:

Precisionμ=cTPcc(TPc+FPc),Recallμ=cTPcc(TPc+FNc),F1,μ=2PrecisionμRecallμPrecisionμ+Recallμ\text{Precision}_\mu = \frac{\sum_c TP_c}{\sum_c(TP_c + FP_c)}, \quad \text{Recall}_\mu = \frac{\sum_c TP_c}{\sum_c(TP_c + FN_c)}, \quad F_{1,\mu} = 2 \frac{\text{Precision}_\mu \cdot \text{Recall}_\mu}{\text{Precision}_\mu + \text{Recall}_\mu}

  • Macro-F1 (mortality): Per-horizon averages treated equally:

F1,macro=17hF1,hF_{1,\mathrm{macro}} = \frac{1}{7}\sum_h F_{1,h}

Accuracy is the headline metric in Chinese CLUE, while Clinical CLUE prioritizes micro- and macro-F1 to sensitivize analysis to class imbalance and clinical priorities.

5. Baseline Architectures and Performance

Chinese CLUE Models

Nine pre-trained transformer variants (BERT, ERNIE, ALBERT, XLNet, RoBERTa) are evaluated with standard softmax classification heads:

Model TNEWS IFLYTEK WSC2020 AFQMC CSL OCNLI
BERT-base 56.58 60.29 63.45 73.70 80.36 72.20
RoBERTa-wwm-ext-large 58.61 62.98 81.38 76.55 82.13 78.20

Reported accuracies illustrate notable gaps to human-level performance, providing benchmarked baselines but leaving headroom for innovation in modeling and fine-tuning.

Clinical CLUE Models

Transformer-based models specific to English clinical data, each fine-tuned on the benchmark’s splits:

Model Phenotyping (micro-F1) Mortality (macro-F1)
BERT-base 0.75 0.62
ClinicalBERT 0.78 0.65
T5-base 0.80 0.67
BigBird 0.81 0.69

Hyperparameters (learning rate, epochs, batch size, sequence length) are carefully tuned, with python-based toolkits and HuggingFace integration streamlining reproducibility.

6. Best Practices, Tooling, and Challenges

  • Data Splitting: Stratification by patient and confounder preservation is essential; using prescribed splits is strongly recommended to prevent data leakage and spurious findings.
  • Toolkit Features: Both benchmarks provide pre-packaged data loaders (CSV/JSON), metric scripts, and evaluation APIs. PyCLUE (Chinese CLUE) and Clinical CLUE’s Python suite simplify deployment and comparison.
  • Metrics Reporting: Both micro- and macro-averaged metrics should be reported for multi-label tasks to reveal aggregate and per-class effects.
  • Calibration and Test Sets: The inclusion of a calibration set for threshold tuning, alongside a strictly held-out test set, is emphasized to support fair metric optimization.
  • Model Sharing: Publication of improved model weights and configurations strengthens reproducibility and collective benchmarking progress.

7. Implications and Future Directions

The increased rigor introduced by the CLUE benchmarks for classification tasks reduces fragmentation, elevates comparability, and sets methodological standards for Chinese NLU and clinical NLP research. A plausible implication is that standardized benchmarks such as CLUE serve as catalysts for broader adoption of robust evaluation principles in substrate languages (e.g., Chinese) and domains (clinical). Remaining challenges include handling class imbalance, advancing state-of-the-art architectures, and developing improved metrics and calibration techniques. Ensuring per-label confusion matrix analysis is recommended to avoid masking poor performance on rare, but clinically or linguistically critical, classes. The continued evolution of pretraining strategies, explainability integrations, and open community contributions will likely define the trajectory of classification research anchored in CLUE.

By consolidating clinically and linguistically motivated classification tasks, CLUE benchmarks enable transparent, reproducible, and extensible evaluation for both academic and applied modeling communities across diverse NLU domains (Goodwin et al., 2022, Xu et al., 2020).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CLUE Benchmark for Classification Tasks.