Papers
Topics
Authors
Recent
2000 character limit reached

CHIP 2025 Shared Task 2: Discharge Med Rec

Updated 16 November 2025
  • CHIP 2025 Shared Task 2 is a benchmark challenge focused on multi-label discharge medication recommendation from de-identified Chinese EHR narratives for metabolic disease management.
  • Participants employed a diverse range of techniques including fine-tuned LLMs, CNN/RNN architectures, and ensemble methods to enhance performance metrics like Jaccard and F1 scores.
  • The challenge emphasizes robust strategies such as domain-aware prompt engineering, data augmentation, and multi-dimensional feature fusion to address label imbalance and promote patient safety.

The CHIP 2025 Shared Task 2 centered on discharge medication recommendation for metabolic diseases using real-world Chinese Electronic Health Records (EHRs). The task operationalized the multi-label problem of predicting appropriate discharge medications from heterogeneous, de-identified clinical narratives, leveraging recent advances in LLMs and ensemble methodologies. With strong participation and measurable methodological improvements over established baselines, the competition provides a benchmark for algorithmic solutions to realistic, high-impact clinical decision support challenges in Chinese healthcare contexts (Li et al., 9 Nov 2025).

1. Problem Scope and Clinical Motivation

Discharge medication recommendation is an integral component of chronic disease management. Chronic metabolic diseases, including diabetes, hypertension, and fatty liver disease, demand long-term, multifaceted pharmacotherapy often complicated by comorbidities and polypharmacy. At the point of hospital discharge, identifying an optimal subset of medications is critical for:

  • Maintaining disease control and preventing exacerbations.
  • Reducing hospital readmission rates and associated healthcare costs.
  • Promoting patient safety by minimizing drug–drug interactions and dosing errors.

Formally, the shared task required, for each de-identified inpatient EHR containing demographic details, clinical course, laboratory values, prior medical history, and discharge diagnoses, the prediction of a subset y^iD\hat y_i \subseteq \mathcal{D} where D\mathcal{D} is a vocabulary of 651 candidate medications. This effectively framed the problem as multi-label classification with a large, flat label space.

2. Dataset Construction: CDrugRed

CDrugRed—introduced for CHIP 2025 Shared Task 2—comprises 5,894 hospitalization records from 3,190 unique patients (spanning 2013–2023) at a top-tier tertiary hospital in China. Its key features are summarized below:

Aspect Description Notes
Records 5,894 (from 3,190 patients) Patient-level split to prevent leakage
Drug Vocabulary 651 distinct, most-used metabolic disease medications No hierarchical ontology provided
Input Modalities EHR text: demographics, vitals, admission info, labs, etc. Structured clinical and narrative text
De-identification Automated PHI masking (large-model NER), normalization, clinician review Ensured privacy and clinical term consistency
Data Partitioning 60% train, 10% val, 30% test (by patient, not by record) Standardized split across participants

Candidate medications in the dataset are presented as a flat list; participants could, however, engineer auxiliary higher-level groupings (e.g., statins, antihypertensives) for feature construction.

3. Evaluation Metrics

Given the multi-label nature, the task adopted a comprehensive suite of metrics for single-example (per-record) and macro-averaged performance:

  • Jaccard Index:

Jaccard(yi,y^i)=yiy^iyiy^i\mathrm{Jaccard}(y_i,\hat y_i)=\frac{|y_i\cap \hat y_i|}{|y_i\cup \hat y_i|}

  • Precision:

Precision(yi,y^i)=yiy^iy^i\mathrm{Precision}(y_i,\hat y_i) = \frac{|y_i\cap \hat y_i|}{|\hat y_i|}

  • Recall:

Recall(yi,y^i)=yiy^iyi\mathrm{Recall}(y_i,\hat y_i) = \frac{|y_i\cap \hat y_i|}{|y_i|}

  • F1 Score:

F1(yi,y^i)=2  Precision×RecallPrecision+Recall\mathrm{F1}(y_i,\hat y_i) = 2\;\frac{\mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}

  • Macro-averaged metrics over NN records (MacroP\mathrm{MacroP}, MacroR\mathrm{MacroR}, MacroF1\mathrm{MacroF1}).
  • Leaderboard Composite Score:

Score=12(Jaccardavg+MacroF1)\mathrm{Score} = \frac{1}{2}\left(\mathrm{Jaccard}_{\rm avg} + \mathrm{MacroF1}\right)

Jaccard directly quantifies set overlap, while F1 balances precision (mitigating over-prescription) and recall (avoiding omission of necessary medications).

4. Methodological Landscape and Participant Solutions

Participants employed a spectrum of natural language processing and machine learning methods, summarized as follows:

Approach Description
TF-IDF + Logistic Regression Classical baselines for sparse, bag-of-words features
Deep Neural Networks CNN and RNN architectures for sequential modeling of input narratives
Fine-tuned LLMs (Qwen, GLM4-9B-Chat) Pretrained transformers with LoRA or full supervised fine-tuning on CDrugRed
Prompt and Retrieval-Augmented Generation Input prompts engineered for better coverage and controlled decoding
Data Augmentation Order perturbation, pseudo labeling, oversampling of rare-drug examples
Ensemble Methods Temperature-diverse decoding, hierarchical and weighted voting among model outputs

The top-ranked team (DeepDrug) delivered a generative Qwen LLM-based framework featuring:

  • Multi-dimensional feature augmentation including drug-class annotations, explicit patient meta-features (e.g., age, BMI bin, comorbidities), and disease–drug co-occurrence as auxiliary prompts.
  • Input robustness via stochastic order perturbation of diagnosis and medication lists during training.
  • Fusion of multiple Qwen LLMs (varying in model size), integrated via a hierarchical weighted voting scheme with adaptive model weights calibrated by validation Jaccard.
  • These innovations contributed to their leaderboard-leading Score of 0.5685 (Jaccard = 0.5102, F1 = 0.6267).

5. Competition Results and Empirical Findings

Results from both the validation (Phase A) and test (Phase B) leaderboards are as follows:

Team/Method Jaccard F1 Score Δ (Top vs. Baseline)
Baseline (GLM4-9B-Chat + LoRA) 0.4444 0.5621 0.5032
DeepDrug (Phase A) 0.5102 0.6267 0.5685 +6.58%, +6.45%
Baseline (Phase B test) 0.4477 0.5648 0.5062
DeepDrug (Phase B) 0.5102 0.6267 0.5685 +6.25%, +6.21%

Top 10 teams’ scores were tightly distributed (0.5685–0.5226), indicating a competitive landscape and highlighting incremental, domain-specific advances.

Critical factors underpinning successful submissions included:

  • Domain-aware prompt engineering, enriched input features (meta-data and expert-constructed drug groupings).
  • Data-centric strategies addressing rare-class imbalance and robustness to input ordering.
  • Application of ensemble techniques (weighted voting, two-stage fusion) to balance bias and variance.

Identified weaknesses included reduced recall for infrequently prescribed drugs, input order/format sensitivity, and occasional model hallucinations of drugs absent from the candidate vocabulary. The last was mitigated through post-hoc candidate filtering.

6. Remaining Challenges and Prospects for Development

Persistent challenges in algorithmic medication recommendation emerge from:

  • Heterogeneity and noise within EHR narratives, varying across patients and hospital departments.
  • High inter-patient variability (comorbidities, polypharmacy, individualized regimens).
  • Pronounced label imbalance due to skew in drug prescription frequencies.
  • Generalization limitations across healthcare institutions with divergent data schemas.

Proposed avenues for future research and dataset/benchmark expansion include:

  • Scaling CDrugRed to incorporate multi-center data and broadening scope beyond metabolic diseases.
  • Integrating multi-modal clinical data types (e.g., images, laboratory values) to enrich patient representation.
  • Evolving the prediction target from medication name lists to full prescription regimens, encompassing dosage, frequency, route.
  • Developing explainable models offering clinical rationales, for example by highlighting salient laboratory results or diagnoses.
  • Embedding external pharmacological knowledge bases to enhance safety via automated drug-interaction checks.

7. Significance and Benchmark Status

CHIP 2025 Shared Task 2 and its CDrugRed corpus establish a high-quality benchmark for multi-label medication recommendation aligned with real-world Chinese hospital workflows. The task demonstrates empirically that fine-tuning LLMs with explicitly engineered features, data augmentation, and robust ensembling can achieve superior performance (up to a Jaccard of 0.5102 and F1 of 0.6267) compared to prior baselines. Nevertheless, the findings underscore unresolved issues of model generalization, explainability, and rare-drug detection, setting a clear research agenda for future clinical NLP studies in the Chinese context (Li et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CHIP 2025 Shared Task 2.