MedDialog-CN: Chinese Medical Dialogue Dataset
- MedDialog-CN is a large-scale Chinese medical dialogue dataset capturing over 1.1 million authentic patient–doctor consultations from online healthcare platforms.
- It is organized hierarchically with detailed structural annotations for patient history, dialogue exchanges, and optional doctor diagnoses, supporting diverse machine learning tasks.
- Spanning 29 medical specialties over a decade, the dataset enables robust telemedicine system development, bias reduction, and natural language processing research.
MedDialog-CN is a large-scale Chinese medical dialogue dataset explicitly constructed to advance research and development of medical dialogue systems, particularly for telemedicine applications. Recognized for its unprecedented scale, coverage, and detailed structural annotation, it serves as a foundational resource for machine learning and natural language processing in clinical conversational scenarios. The dataset encompasses over 1.1 million consultations and 4 million utterances collected from contemporary online medical platforms, capturing ten years of authentically situated patient–doctor interactions.
1. Data Composition and Structure
MedDialog-CN comprises 1,145,231 patient–doctor consultations and a total of 3,959,333 utterances (2,179,008 doctor, 1,780,325 patient). The corpus is organized hierarchically as follows:
- Consultation-level Structure
- Patient Condition & History: Contains detailed fields, including present disease, disease duration, medications, allergies, past diseases, and explicit help requests. This segment introduces context for each dialogue, mirroring pre-consultation forms encountered in clinical triage.
- Dialogue: Interleaved utterances between patient and doctor. Each exchange is segmented, with some consecutive utterances by the same speaker recorded as individual entries.
- Optional Diagnosis and Treatment: For a subset of samples, the doctor provides diagnostic conclusions and therapeutic suggestions post dialogue.
A subset aggregation merges consecutive speaker utterances, resulting in 3,209,660 composite utterances (1,981,844 doctor, 1,227,816 patient), streamlining analysis of turn-based conversational modeling.
2. Scale, Diversity, and Specialty Coverage
MedDialog-CN is distinguished by its breadth—consultations are mapped to 29 broad medical specialties and 172 fine-grained categories, spanning domains from internal medicine to pediatrics, dentistry, and beyond. Statistical diversity is enforced across:
- Temporal: Dialogues cover the period from 2010–2020, ensuring temporal relevance.
- Geographic: Data sources encompass all 31 provincial-level administrative divisions in China, minimizing regional sampling bias and reflecting a comprehensive patient demographic.
This diversity both mitigates skew from local clinical practices and supports generalizable model training.
3. Technical Representation and Annotation Guidelines
Unlike datasets that are annotated at coarse granularity (e.g., label-per-dialogue), MedDialog-CN is organized to facilitate both supervised and unsupervised tasks but does not include intent/action slot labellings or multi-stage sub-utterance annotations typical in more recent multi-domain resources (cf. ReMeDi (Yan et al., 2021)). Technical details are outlined in structured statistical reporting—utterance counts, specialty categorization, and hierarchical sectioning:
Statistic | Count | Doctor | Patient |
---|---|---|---|
Consultations | 1,145,231 | - | - |
Utterances (raw) | 3,959,333 | 2,179,008 | 1,780,325 |
Utterances (merged turns) | 3,209,660 | 1,981,844 | 1,227,816 |
Specialty categories | 29 | - | - |
Fine-grained specialties | 172 | - | - |
Dialogues are presented with field-specific context but lack token-level or utterance-level annotations for named entities or structured dialog acts, limiting applicability for fine-grained NLU benchmark construction without further preprocessing.
4. Data Source Authenticity and Collection Protocol
The dataset was sourced from haodf.com, a leading online healthcare platform in China, providing real patient–doctor interactions rather than simulated or crowd-sourced conversations. This method ensures clinical authenticity, capturing linguistic patterns, diagnostic rationale, and conversational etiquette characteristic of genuine consultations.
Sampled data from 2010 to 2020 reflects evolving medical practices, patient concerns, and regulatory shifts, framing longitudinal studies on dialogue trends and changes in patient communication behavior.
5. Application Domains and Research Utility
MedDialog-CN is tailored for a wide array of research areas including:
- Dialogue System Training: Machine learning models, such as sequence-to-sequence architectures and transformer-based deep neural networks, are commonly trained to generate contextually and medically appropriate responses using MedDialog-CN as the grounding corpus.
- Virtual Doctor Algorithms: Prevention of physician burnout and workload reduction is facilitated by leveraging the dataset for virtual doctor deployment, triage, and patient follow-up systems.
- Telemedicine Platforms: Remote consultation protocols utilize MedDialog-CN to refine conversational agents, optimizing informational sufficiency and patient safety.
- Health Communication Analytics: Researchers employ the dataset to analyze conversational dynamics, paper adherence cues, and explore engagement patterns.
- Bias Minimization: Rich regional and temporal representation reduces overfitting and sample bias, facilitating the development of robust dialogue policies and generative models generalizable across populations.
6. Limitations and Comparison with Contemporary Resources
Unlike resources such as ReMeDi (Yan et al., 2021), which feature multi-domain, multi-service coverage with sub-utterance-level intent and slot labeling and external knowledge grounding, or IMCS-21 (Chen et al., 2022), with multi-level entity, act, and diagnosis annotation, MedDialog-CN is comparatively coarse in its labeling schema. While it excels in raw scale and specialty breadth, applications demanding fine-grained clinical reasoning, intent detection, or reinforcement learning alignment would require additional annotation layers.
Furthermore, MedDialog-CN’s primary strength lies in enabling naturalistic generation, policy modeling, and broad statistical analyses rather than supporting nuanced NLU, DPL, or computational robustness benchmarks as emphasized by newer datasets, e.g., MedGPTEval (Xu et al., 2023) and ChiMed 2.0 (Tian et al., 21 Jul 2025).
7. Significance for Telemedicine and Future Directions
By underpinning development of dialogue agents capable of both informative and safe patient interactions, MedDialog-CN has become a reference standard in telemedicine research. Its widespread specialty coverage and authentic clinical scenarios empower scale-up of conversational AI tools, support model evaluation protocols, and enable cross-sectional analyses of medical dialogue.
A plausible implication is that future work may focus on enriching MedDialog-CN with token-level annotations, aligning conversational turns to medical knowledge bases, and integrating contextual robustness evaluation—thereby enhancing its utility for safety-critical tasks, policy learning, and fine-grained performance benchmarking as model and dataset standards continue to evolve.