CHIP 2025 Shared Task 2: Discharge Med Rec

Updated 16 November 2025

CHIP 2025 Shared Task 2 is a benchmark challenge focused on multi-label discharge medication recommendation from de-identified Chinese EHR narratives for metabolic disease management.
Participants employed a diverse range of techniques including fine-tuned LLMs, CNN/RNN architectures, and ensemble methods to enhance performance metrics like Jaccard and F1 scores.
The challenge emphasizes robust strategies such as domain-aware prompt engineering, data augmentation, and multi-dimensional feature fusion to address label imbalance and promote patient safety.

The CHIP 2025 Shared Task 2 centered on discharge medication recommendation for metabolic diseases using real-world Chinese Electronic Health Records (EHRs). The task operationalized the multi-label problem of predicting appropriate discharge medications from heterogeneous, de-identified clinical narratives, leveraging recent advances in LLMs and ensemble methodologies. With strong participation and measurable methodological improvements over established baselines, the competition provides a benchmark for algorithmic solutions to realistic, high-impact clinical decision support challenges in Chinese healthcare contexts (Li et al., 9 Nov 2025).

1. Problem Scope and Clinical Motivation

Discharge medication recommendation is an integral component of chronic disease management. Chronic metabolic diseases, including diabetes, hypertension, and fatty liver disease, demand long-term, multifaceted pharmacotherapy often complicated by comorbidities and polypharmacy. At the point of hospital discharge, identifying an optimal subset of medications is critical for:

Maintaining disease control and preventing exacerbations.
Reducing hospital readmission rates and associated healthcare costs.
Promoting patient safety by minimizing drug–drug interactions and dosing errors.

Formally, the shared task required, for each de-identified inpatient EHR containing demographic details, clinical course, laboratory values, prior medical history, and discharge diagnoses, the prediction of a subset $\hat y_i \subseteq \mathcal{D}$ where $\mathcal{D}$ is a vocabulary of 651 candidate medications. This effectively framed the problem as multi-label classification with a large, flat label space.

2. Dataset Construction: CDrugRed

CDrugRed—introduced for CHIP 2025 Shared Task 2—comprises 5,894 hospitalization records from 3,190 unique patients (spanning 2013–2023) at a top-tier tertiary hospital in China. Its key features are summarized below:

Aspect	Description	Notes
Records	5,894 (from 3,190 patients)	Patient-level split to prevent leakage
Drug Vocabulary	651 distinct, most-used metabolic disease medications	No hierarchical ontology provided
Input Modalities	EHR text: demographics, vitals, admission info, labs, etc.	Structured clinical and narrative text
De-identification	Automated PHI masking (large-model NER), normalization, clinician review	Ensured privacy and clinical term consistency
Data Partitioning	60% train, 10% val, 30% test (by patient, not by record)	Standardized split across participants

Candidate medications in the dataset are presented as a flat list; participants could, however, engineer auxiliary higher-level groupings (e.g., statins, antihypertensives) for feature construction.

3. Evaluation Metrics

Given the multi-label nature, the task adopted a comprehensive suite of metrics for single-example (per-record) and macro-averaged performance:

Jaccard Index:

$\mathrm{Jaccard}(y_i,\hat y_i)=\frac{|y_i\cap \hat y_i|}{|y_i\cup \hat y_i|}$

Precision:

$\mathrm{Precision}(y_i,\hat y_i) = \frac{|y_i\cap \hat y_i|}{|\hat y_i|}$

Recall:

$\mathrm{Recall}(y_i,\hat y_i) = \frac{|y_i\cap \hat y_i|}{|y_i|}$

F1 Score:

$\mathrm{F1}(y_i,\hat y_i) = 2\;\frac{\mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Macro-averaged metrics over $N$ records ( $\mathrm{MacroP}$ , $\mathrm{MacroR}$ , $\mathrm{MacroF1}$ ).
Leaderboard Composite Score:

$\mathrm{Score} = \frac{1}{2}\left(\mathrm{Jaccard}_{\rm avg} + \mathrm{MacroF1}\right)$

Jaccard directly quantifies set overlap, while F1 balances precision (mitigating over-prescription) and recall (avoiding omission of necessary medications).

4. Methodological Landscape and Participant Solutions

Participants employed a spectrum of natural language processing and machine learning methods, summarized as follows:

Approach	Description
TF-IDF + Logistic Regression	Classical baselines for sparse, bag-of-words features
Deep Neural Networks	CNN and RNN architectures for sequential modeling of input narratives
Fine-tuned LLMs (Qwen, GLM4-9B-Chat)	Pretrained transformers with LoRA or full supervised fine-tuning on CDrugRed
Prompt and Retrieval-Augmented Generation	Input prompts engineered for better coverage and controlled decoding
Data Augmentation	Order perturbation, pseudo labeling, oversampling of rare-drug examples
Ensemble Methods	Temperature-diverse decoding, hierarchical and weighted voting among model outputs

The top-ranked team (DeepDrug) delivered a generative Qwen LLM-based framework featuring:

Multi-dimensional feature augmentation including drug-class annotations, explicit patient meta-features (e.g., age, BMI bin, comorbidities), and disease–drug co-occurrence as auxiliary prompts.
Input robustness via stochastic order perturbation of diagnosis and medication lists during training.
Fusion of multiple Qwen LLMs (varying in model size), integrated via a hierarchical weighted voting scheme with adaptive model weights calibrated by validation Jaccard.
These innovations contributed to their leaderboard-leading Score of 0.5685 (Jaccard = 0.5102, F1 = 0.6267).

5. Competition Results and Empirical Findings

Results from both the validation (Phase A) and test (Phase B) leaderboards are as follows:

Team/Method	Jaccard	F1	Score	Δ (Top vs. Baseline)
Baseline (GLM4-9B-Chat + LoRA)	0.4444	0.5621	0.5032	—
DeepDrug (Phase A)	0.5102	0.6267	0.5685	+6.58%, +6.45%
Baseline (Phase B test)	0.4477	0.5648	0.5062	—
DeepDrug (Phase B)	0.5102	0.6267	0.5685	+6.25%, +6.21%

Top 10 teams’ scores were tightly distributed (0.5685–0.5226), indicating a competitive landscape and highlighting incremental, domain-specific advances.

Critical factors underpinning successful submissions included:

Domain-aware prompt engineering, enriched input features (meta-data and expert-constructed drug groupings).
Data-centric strategies addressing rare-class imbalance and robustness to input ordering.
Application of ensemble techniques (weighted voting, two-stage fusion) to balance bias and variance.

Identified weaknesses included reduced recall for infrequently prescribed drugs, input order/format sensitivity, and occasional model hallucinations of drugs absent from the candidate vocabulary. The last was mitigated through post-hoc candidate filtering.

6. Remaining Challenges and Prospects for Development

Persistent challenges in algorithmic medication recommendation emerge from:

Heterogeneity and noise within EHR narratives, varying across patients and hospital departments.
High inter-patient variability (comorbidities, polypharmacy, individualized regimens).
Pronounced label imbalance due to skew in drug prescription frequencies.
Generalization limitations across healthcare institutions with divergent data schemas.

Proposed avenues for future research and dataset/benchmark expansion include:

Scaling CDrugRed to incorporate multi-center data and broadening scope beyond metabolic diseases.
Integrating multi-modal clinical data types (e.g., images, laboratory values) to enrich patient representation.
Evolving the prediction target from medication name lists to full prescription regimens, encompassing dosage, frequency, route.
Developing explainable models offering clinical rationales, for example by highlighting salient laboratory results or diagnoses.
Embedding external pharmacological knowledge bases to enhance safety via automated drug-interaction checks.

7. Significance and Benchmark Status

CHIP 2025 Shared Task 2 and its CDrugRed corpus establish a high-quality benchmark for multi-label medication recommendation aligned with real-world Chinese hospital workflows. The task demonstrates empirically that fine-tuning LLMs with explicitly engineered features, data augmentation, and robust ensembling can achieve superior performance (up to a Jaccard of 0.5102 and F1 of 0.6267) compared to prior baselines. Nevertheless, the findings underscore unresolved issues of model generalization, explainability, and rare-drug detection, setting a clear research agenda for future clinical NLP studies in the Chinese context (Li et al., 9 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Overview of CHIP 2025 Shared Task 2: Discharge Medication Recommendation for Metabolic Diseases Based on Chinese Electronic Health Records (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CHIP 2025 Shared Task 2.

CHIP 2025 Shared Task 2: Discharge Med Rec

1. Problem Scope and Clinical Motivation

2. Dataset Construction: CDrugRed

3. Evaluation Metrics

4. Methodological Landscape and Participant Solutions

5. Competition Results and Empirical Findings

6. Remaining Challenges and Prospects for Development

7. Significance and Benchmark Status

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

CHIP 2025 Shared Task 2: Discharge Med Rec

1. Problem Scope and Clinical Motivation

2. Dataset Construction: CDrugRed

3. Evaluation Metrics

4. Methodological Landscape and Participant Solutions

5. Competition Results and Empirical Findings

6. Remaining Challenges and Prospects for Development

7. Significance and Benchmark Status

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research