Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DDXPlus: A New Dataset For Automatic Medical Diagnosis (2205.09148v3)

Published 18 May 2022 in cs.CL, cs.AI, and cs.LG

Abstract: There has been a rapidly growing interest in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence about their symptoms and relevant antecedents, and possibly make predictions about the underlying diseases. Doctors would review the interactions, including the evidence and the predictions, collect if necessary additional information from patients, before deciding on next steps. Despite recent progress in this area, an important piece of doctors' interactions with patients is missing in the design of these systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth pathology, symptoms and antecedents for each patient. Unlike existing datasets which only contain binary symptoms and antecedents, this dataset also contains categorical and multi-choice symptoms and antecedents useful for efficient data collection. Moreover, some symptoms are organized in a hierarchy, making it possible to design systems able to interact with patients in a logical way. As a proof-of-concept, we extend two existing AD and ASD systems to incorporate the differential diagnosis, and provide empirical evidence that using differentials as training signals is essential for the efficiency of such systems or for helping doctors better understand the reasoning of those systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Arsene Fansi Tchango (3 papers)
  2. Rishab Goel (10 papers)
  3. Zhi Wen (3 papers)
  4. Julien Martel (6 papers)
  5. Joumana Ghosn (5 papers)
Citations (29)

Summary

  • The paper introduces DDXPlus to address existing gaps by including differential diagnoses that mirror clinical decision-making.
  • The paper employs a rule-based system and hierarchical symptom organization to generate over 1.3 million detailed synthetic patient records.
  • The paper demonstrates improved performance in ASD/AD models, validating the dataset’s potential to enhance real-world clinical applications.

An Overview of DDXPlus: A Novel Dataset for Automatic Medical Diagnosis

The paper "DDXPlus: A New Dataset For Automatic Medical Diagnosis" introduces DDXPlus, a synthetic dataset poised to advance research in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems. The dataset addresses significant gaps in existing medical diagnosis datasets by including data not only on symptoms and antecedents but also on differential diagnoses, thus providing a more comprehensive tool for model training.

Key Contributions

The DDXPlus dataset represents a major step forward by merging multiple features that are either missing or incomplete in existing datasets. These contributions are highlighted as follows:

  1. Inclusion of Differential Diagnosis:
    • Most existing datasets focus solely on the binary classification of symptoms to diagnose a single disease. In reality, medical practitioners use a list of potential diagnoses, known as a differential diagnosis. DDXPlus includes this integral piece of information, thereby enabling models to train on how clinicians inherently work with uncertainties.
  2. Diverse Types of Evidences:
    • In contrast to traditional datasets that primarily use binary symptoms, DDXPlus incorporates categorical and multi-choice symptoms and antecedents. This diversification aligns better with actual clinical questioning patterns and enhances data collection efficiency.
  3. Hierarchical Organization of Symptoms:
    • Some symptoms are structured hierarchically, which can help models simulate the logical progression of a medical interview more accurately.
  4. Proprietary Medical Knowledge Base:
    • The dataset is synthesized using data from an extensive medical knowledge base with over 20,000 medical papers, ensuring high fidelity of generated patient profiles. It spans 440 pathologies and 802 symptoms/antecedents but initially focuses on a subset related to cough, sore throat, and breathing issues, covering 49 pathologies, 110 symptoms, and 113 antecedents.
  5. Scale and Variety:
    • DDXPlus includes approximately 1.3 million synthetic patient records, providing extensive data for training robust and diverse machine learning models.

Dataset Characteristics and Generation Process

The dataset comprises numerous attributes for each patient including the patient's age, sex, geographical region, symptoms, antecedents, the ground truth pathology, and corresponding differential diagnosis. The generation process follows two primary steps:

  1. Patient Data Synthesis:
    • Using a proprietary medical knowledge base, public census data, and Synthea, patient demographics, diseases, symptoms, and antecedents are synthesized. Attention to different types of symptoms and sophisticated hierarchical organization allows for a rich, detailed dataset.
  2. Differential Diagnosis Generation:
    • Differential diagnoses for each synthesized patient are generated via a commercial rule-based AD system by cross-referencing the presented symptoms and antecedents against the rules embedded in the medical knowledge base.

Empirical Evaluation

Two existing models, AARLC (Adaptive Alignment of Reinforcement Learning and Classification) and BASD (Baseline ASD), were extended to leverage differential diagnoses as training signals, demonstrating considerable performance improvements:

  • AARLC Model Enhancements:
    • When trained to predict differentials, the AARLC model showed significant improvements in differential diagnosis recall (DDR), precision (DDP), and F1 scores (DDF1), outperforming its earlier single pathology classification approach.
    • The interaction lengths (IL) increased, indicating the model pursued a more comprehensive evidence collection.
  • BASD Model Adjustments:
    • While BASD's positive evidence recall (PER) did not improve remarkably, the model's ability to render differential diagnoses demonstrated better alignment with medical reasoning, showcasing enhanced DDR, DDP, and DDF1 metrics.

Implications and Future Directions

The inclusion of differential diagnoses within the dataset not only improves the training efficacy of ASD/AD systems but also extends their applicability within clinical settings. By training models on how real clinicians assess multiple plausible pathologies, DDXPlus can lead to the development of systems that clinicians trust and use seamlessly.

Future research can leverage DDXPlus for various explorations:

  • Refining the models to handle hierarchical symptom structures.
  • Incorporating severity levels into model training for prioritization of high-risk pathologies.
  • Extending the dataset to include all 440 pathologies and examining its impact on model robustness.
  • Investigating machine learning techniques that efficiently integrate both positive and negative evidence collection.

Conclusion

DDXPlus marks a significant advancement in the availability and comprehensiveness of datasets for ASD and AD research. By closely mirroring real-life clinical diagnostic processes and interactions, DDXPlus enables the development of diagnostic systems that align with and enhance clinical practice, providing a platform for significant future advancements in medical AI.

X Twitter Logo Streamline Icon: https://streamlinehq.com