Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedDialog: Two Large-scale Medical Dialogue Datasets (2004.03329v2)

Published 7 Apr 2020 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations between patients and doctors and 0.5 million utterances. MedDialog-CN is an Chinese dataset containing 1.1 million conversations and 4 million utterances. To our best knowledge, MedDialog-(EN,CN) are the largest medical dialogue datasets to date. The dataset is available at https://github.com/UCSD-AI4H/Medical-Dialogue-System

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Xuehai He (26 papers)
  2. Shu Chen (181 papers)
  3. Zeqian Ju (13 papers)
  4. Xiangyu Dong (17 papers)
  5. Hongchao Fang (4 papers)
  6. Sicheng Wang (18 papers)
  7. Yue Yang (146 papers)
  8. Jiaqi Zeng (16 papers)
  9. Ruisi Zhang (18 papers)
  10. Ruoyu Zhang (25 papers)
  11. Meng Zhou (33 papers)
  12. Penghui Zhu (1 paper)
  13. Pengtao Xie (86 papers)
Citations (153)

Summary

Overview of MedDialog: Two Large-scale Medical Dialogue Datasets

This paper introduces MedDialog, two large-scale datasets designed to advance research in medical dialogue systems. The authors present MedDialog-EN and MedDialog-CN as the largest datasets of their kind, with MedDialog-EN containing approximately 0.3 million English conversations and MedDialog-CN consisting of 1.1 million Chinese conversations.

Dataset Composition and Features

Both datasets serve as a comprehensive resource for developing AI-driven telemedicine solutions. MedDialog-EN includes over 514,000 utterances across 257,454 consultations, while MedDialog-CN comprises nearly 4 million utterances from 1,145,231 consultations. Notably, these datasets encompass a wide array of specialties, with MedDialog-EN covering 96 categories and MedDialog-CN offering insights into 172 fine-grained specialties.

The datasets were meticulously curated to ensure diversity, capturing dialogues from a varied patient demographic. This diversity, spanning various age groups, genders, and locations, is pivotal for minimizing population biases and enhancing the generalizability of models trained on these datasets.

Implications for Medical Dialogue Systems

The introduction of these extensive datasets addresses key limitations in existing data resources, which were previously constrained by size and scope. By providing a rich repository of patient-doctor interactions, MedDialog enables the training of more nuanced and robust medical dialogue systems. These systems have the potential to function as virtual doctors, engaging with patients through natural language interactions, offering clinical advice, and monitoring patient conditions remotely.

Theoretical and Practical Significance

From a theoretical perspective, the datasets support the development of dialogue systems that aim to achieve doctor-level intelligence and adaptability across medical disciplines. Practically, as telemedicine becomes increasingly integral to healthcare delivery, these datasets offer significant value in reducing the burden on healthcare providers by automating routine consultations and tracking patient progress.

Potential for Future Research

The datasets' public availability opens up numerous avenues for future research. Scholars can explore advancements in dialogue management, context understanding, and response generation within the medical domain. Additionally, the datasets present opportunities to develop models that better handle multilingual and cross-cultural nuances in medical consultations.

The MedDialog datasets, by virtue of their scale and comprehensiveness, represent a crucial step forward in realizing effective and scalable telemedicine solutions. As research in AI and natural language processing progresses, these datasets will likely underpin many innovative developments in healthcare technologies.

Github Logo Streamline Icon: https://streamlinehq.com