Papers
Topics
Authors
Recent
2000 character limit reached

Medical Conversations to Disease Dataset

Updated 7 December 2025
  • Medical Conversations to Disease Dataset is a corpus of multi-turn patient-clinician and patient–bot dialogues annotated with single or multiple disease labels, crucial for diagnostic and conversational AI research.
  • It incorporates diverse data sources—from real clinical consultations to synthetic dialogues—using rigorous annotation protocols to capture symptoms, intents, and entity details.
  • The dataset supports robust benchmarking for tasks such as disease prediction, entity extraction, and dialogue generation, with performance metrics like accuracy and F1 score.

A Medical Conversations to Disease Dataset is a corpus in which each record consists of a patient–clinician (or patient–bot) dialogue mapped to a disease label or a set of disease labels. These resources are foundational for developing machine learning models for diagnostic prediction, symptom extraction, medical reasoning, and conversational AI evaluation. Modern datasets span clinical, telemedicine, mental health, and chatbot domains, incorporate both real and synthetic conversations, and employ rigorous annotation protocols.

1. Dataset Scope, Structure, and Disease Labeling

A Medical Conversations to Disease Dataset typically consists of multi-turn dialogues between patients and clinicians (or between patients and AI agents), with each conversation annotated at the dialogue level by one or more disease labels. Datasets vary by clinical focus, conversational structure, annotation granularity, and scale.

  • Scope of Clinical Conditions: Datasets range from narrow single-disease coverage (e.g., only COVID-19 in CovidDialog (Yang et al., 2020)) to broad, multi-domain and multi-disease taxonomies spanning hundreds of conditions, as in ReMeDi (843 diseases) (Yan et al., 2021), MedSynth (2,001 ICD-10 codes) (Mianroodi et al., 2 Aug 2025), and the Medical Conversations to Disease Dataset analyzed in (Razavi, 30 Nov 2025) (24 clinical conditions).
  • Dialogue Structure: Conversation length varies widely. MedDG has an average of 21.64 turns per dialogue (gastroenterology consultations) (Liu et al., 2020), Empathical averages 6.56 turns (Tomar et al., 18 May 2024), and the Medical Conversations to Disease Dataset for symptom-pattern mining averages 15 turns (Razavi, 30 Nov 2025). Synthetic datasets engineered for wide disease coverage (e.g., MedSynth, ReMeDi) often employ templated data or LLM-based pipelines to ensure uniformity across disease classes.
  • Labeling Schema: Disease labels may be single or multi-label, mapped to taxonomic standards (ICD-10 codes in MedSynth and MDD-5k (Yin et al., 22 Aug 2024), 843-category taxonomy in ReMeDi). Some datasets include comorbidity combinations (PsyCoTalk: six psychiatric comorbidity types over four core disorders (Wan et al., 29 Oct 2025)) or support mapping dialogue history to multiple diseases per instance.
  • Fine-Grained and Secondary Labels: Beyond disease diagnosis, many datasets provide fine-grained labels at turn or token level—symptoms, attributes, intent (e.g., “symptom” vs. “affirmative” in Empathical), as well as structured “slot” annotation covering duration, medication, test results (Yan et al., 2021, Tomar et al., 18 May 2024, Liu et al., 2020).
  • Data Formats: Most datasets are released as JSON or JSONL, with records containing dialogue IDs, lists of speaker/utterance tuples, and disease label(s). MedSynth further adds a linked clinical note (SOAP format), and some (e.g., ReMeDi, MedDG) include sub-utterance entity spans or slot annotations for NLU research.

2. Data Sources, Collection, and Generation Methodologies

Medical conversation–to–disease datasets are constructed from a range of sources and employ both manual and automatic data generation methods to address privacy, diversity, and coverage constraints.

  • Real Clinical Data: Some corpora derive directly from expert-annotated human–human dialogues. For example, MedDG compiles 17,864 Chinese doctor–patient consultations from Doctor-Chunyu (gastroenterology) (Liu et al., 2020). The “Extracting Structured Data from Physician-Patient Conversations” corpus aggregates 6,862 fully transcribed, timestamped clinical conversations from four specialties (Krishna et al., 2020).
  • Telemedicine and QA Platforms: CovidDialog harvests English data from icliniq.com, healthcaremagic.com, and healthtap.com, with Chinese data from haodf.com (Yang et al., 2020).
  • Synthetic Data and LLM-Augmented Pipelines:
    • DialoGPT on Rural Nepal leverages Gemini and Claude LLMs to generate 1,000 synthetic dialogues covering ten endemic diseases, validated by rural-health practitioners (Poudel et al., 1 Nov 2025).
    • MedSynth uses a multi-agent GPT-4o pipeline to generate 10,035 dialogue–note pairs spanning over 2,000 ICD-10 codes, with scenario-role assignment, fact diversity enforcement, and SOAP note–driven conversation synthesis (Mianroodi et al., 2 Aug 2025).
    • MDD-5k and PsyCoTalk construct large-scale mental health diagnostic dialogues using neuro-symbolic frameworks, diagnosis trees, and context/state machines informed by SCID-5/DSM-5 protocols and synthetic EMRs (Yin et al., 22 Aug 2024, Wan et al., 29 Oct 2025).
  • Dialogue Simulation: Many mental health and low-resource datasets simulate realistic conversational flow using agent-based models, randomized persona assignment, and multi-agent choreography (e.g., MDD-5k, PsyCoTalk).
  • Curation and Preprocessing: Common steps include duplicate removal, de-identification (regex and manual review for privacy), formatting normalization, and consistent tokenization (e.g., WordPiece, BERT-base tokenizers) (Razavi, 30 Nov 2025, Yang et al., 2020, Liu et al., 2020).

3. Annotation Protocols and Symptom Mapping

Annotation schemas increasingly couple disease-label assignment to rich, intermediate structures (e.g., symptom, slot, and entity-level tags), supporting a variety of diagnostic and NLU tasks.

  • Disease Labeling: Typically one “gold” disease per dialogue or a small disease-set per conversation for comorbid corpora (e.g., multi-label in PsyCoTalk (Wan et al., 29 Oct 2025)).
  • Symptom and Entity Labels: MedDG exhaustively annotates five categories (disease, symptom, medicine, examination, attribute), with inter-annotator agreement (Cohen’s κ ∈ [0.948, 0.976]) (Liu et al., 2020). Empathical tags each patient utterance with intent (“symptom” or “affirmative”) and 0–228 symptom slots, achieving Fleiss’ κ = 0.76 (Tomar et al., 18 May 2024).
  • Turn-Level Annotation: MDD-5k attaches the active diagnosis tree node to every utterance; PsyCoTalk uses turn-level “response_label” variables marking symptom presence/absence and tree traversal states (Yin et al., 22 Aug 2024, Wan et al., 29 Oct 2025).
  • Entity Normalization: Datasets such as ReMeDi and MedDG standardize entity spans against knowledge bases (CMeKG2.0, canonical ontologies), and recommend mapping symptoms to SNOMED/LOINC for interoperability (Yan et al., 2021, Liu et al., 2020, Yang et al., 2020).
  • Quality Assurance and Agreement: Human evaluation of realism, fluency, and domain relevance (multiple expert annotators, e.g., psychiatry panels in MDD-5k and PsyCoTalk) ensures data validity and supports benchmarking diversity (“doctor proactivity,” “patient engagement”) (Yin et al., 22 Aug 2024, Wan et al., 29 Oct 2025).

4. Benchmarking Tasks and Computational Methodologies

Datasets are routinely designed to enable robust benchmarking on NLP and AI tasks requiring dialogue-to-disease mapping. Standard model families and evaluation protocols include:

  • Disease Prediction/Classification: Train/test splits for supervised mapping from dialogue history XX to disease label(s) dd^*. ReMeDi (843-way), MedSynth (2,001 ICD-10), and Empathical (90 classes) provide challenging multi-class settings (Yan et al., 2021, Mianroodi et al., 2 Aug 2025, Tomar et al., 18 May 2024).
  • Turn-Level NLU: MedDG and Empathical support entity recognition, slot filling, and intent prediction at the sub-utterance level (Liu et al., 2020, Tomar et al., 18 May 2024).
  • Dialogue Generation and Next-Entity Prediction: Tasks include generating doctor replies either conditioned on patient history and disease (CovidDialog, MedDG), or predicting future entity mentions as an NLU component (Yang et al., 2020, Liu et al., 2020).
  • Comorbidity and Psychiatric Reasoning: PsyCoTalk enables multi-label classification and explicit state tracking over overlapping diagnostic pathways (Wan et al., 29 Oct 2025).
  • NLP Tools Used: Common architectures include BERT-MED, ClinicalBERT, MT5, GPT-family models, hybrid models with knowledge graph attention (GAT, KI-DDI), and LLM-enriched simulation agents (Yan et al., 2021, Tomar et al., 18 May 2024, Mianroodi et al., 2 Aug 2025).
  • Performance Metrics: Accuracy, macro/micro precision/recall/F1, BLEU/ROUGE/METEOR for NLG, human expert ratings (fluency, relevance, expertise), and data-driven topic modeling coherence (e.g., LDA topic coherence 0.32 in (Razavi, 30 Nov 2025)).
  • Pre-filtering and Subset Selection: For long transcripts, “noteworthy utterance” filtering (logistic regression on TF–IDF or UMLS-based) is shown to improve downstream diagnosis accuracy (e.g., micro-F1 jumps from 0.6009 to 0.7029 with ClinicalBERT on “diagnosis-noteworthy” subsets (Krishna et al., 2020)).
  • Data Augmentation and Balancing: Synthesizing or oversampling low-frequency disease categories to address class imbalance (CovidDialog, MedSynth) (Yang et al., 2020, Mianroodi et al., 2 Aug 2025).

5. Analytical Uses: Symptom Pattern Mining and Knowledge Extraction

Datasets bridging conversation and disease labels support advanced analytics for clinical pattern discovery and decision support:

  • Topic Modeling and Clustering: The Medical Conversations to Disease Dataset (Razavi, 30 Nov 2025) employs LDA (K=5) to elucidate latent symptom themes with moderate coherence (0.32), and K-means (K=5) with average silhouette score 0.40 to cluster symptom descriptions. Cluster centroids correspond to recognizable clinical scenarios (e.g., respiratory/allergy, pain/gastrointestinal).
  • NER and Association Rule Mining: Transformer-based NER achieves F1=0.84 in symptom extraction (Razavi, 30 Nov 2025), and Apriori mining uncovers high-confidence associations among symptoms (e.g., {fever} ⇒ {headache}: support=0.12, confidence=0.85).
  • Knowledge Graph Embedding and Fusion: KI-DDI fuses symptom–disease graph attention embeddings with dialogue encodings for disease classification, improving over text-only models (Tomar et al., 18 May 2024).

6. Accessibility, Limitations, and Extensibility

  • Dataset Release and Licensing: Publicly released datasets include MedSynth (HuggingFace, MIT-/CC-style), Empathical (GitHub), MedDG (GitHub)—with varying restrictions for research-only or non-commercial use (Mianroodi et al., 2 Aug 2025, Tomar et al., 18 May 2024, Liu et al., 2020). PsyCoTalk and certain clinical datasets require agreement due to privacy (Wan et al., 29 Oct 2025, Krishna et al., 2020).
  • Language Coverage: Datasets exist in English (CovidDialog, MedSynth, Empathical), Chinese (MedDG, ReMeDi, MDD-5k, PsyCoTalk), and code-mixed or bilingual forms. Cross-lingual generalizability remains a challenge due to resource imbalance (Yang et al., 2020, Yan et al., 2021).
  • Imbalance and Specialty Bias: Many corpora reveal substantial class imbalance (long-tail disease distribution in ReMeDi), and specialty skew (e.g., mostly cardiovascular/family medicine in (Krishna et al., 2020)). Synthesized datasets such as MedSynth and the Nepalese diseases corpus enforce balanced sampling (Mianroodi et al., 2 Aug 2025, Poudel et al., 1 Nov 2025).
  • Privacy and Realism: Ethical and privacy considerations restrict access to real-world patient conversations; synthetic data and rigorous de-identification protocols address this but may trade off some realism (Yin et al., 22 Aug 2024, Poudel et al., 1 Nov 2025). Human validation studies consistently benchmark the plausibility of such data to real interviews (Wan et al., 29 Oct 2025, Yin et al., 22 Aug 2024).
  • Extensibility: Datasets are extensible via integration of new disease categories, normalization to clinical ontologies (SNOMED, LOINC), service-segment annotations, and richer demographic fields (Yang et al., 2020, Yan et al., 2021). A plausible implication is that expansion in scope and annotation depth may directly benefit future diagnostic modeling and benchmark generalizability.

7. Representative Datasets: Summary Comparison

Dataset Language Dialogues Disease Coverage Structure Access / License
CovidDialog (Yang et al., 2020) English/Ch. 603+1,088 COVID-19 (extendable) 2–8.7 turns, symptoms, metadata GitHub, research
ReMeDi (Yan et al., 2021) Chinese 96,965 843 diseases, 40 domains Fine-grained slot annotations Released, research
MedDG (Liu et al., 2020) Chinese 17,864 12 gastro. diseases 21.6 turns, 160 entity types GitHub, academic
MedSynth (Mianroodi et al., 2 Aug 2025) English 10,035 2,001 ICD-10 codes Dialogue + SOAP note HuggingFace, MIT/CC
MDD-5k (Yin et al., 22 Aug 2024) Chinese 5,000 25+ ICD-10 mental dis. Long diagnostic: doctor-patient, tree Planned, research-only
PsyCoTalk (Wan et al., 29 Oct 2025) Chinese 3,000 6 comorbidity combos Hierarchical state dialogue Controlled-access
Empathical (Tomar et al., 18 May 2024) English 1,367 90 diseases 6.56 turns, intent/symptom slots GitHub
Nepal Rural (Poudel et al., 1 Nov 2025) English 1,000 10 common diseases 6–12 turns, synthetic, CSV+JSON By request
Med Conv→Disease (Razavi, 30 Nov 2025) English 960 24 clinical conditions 15 turns, NER, symptom pairs Article-specific

References

For additional operational details, access instructions, and annotation protocols, refer to the specific dataset repositories and the associated publications.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Medical Conversations to Disease Dataset.