- The paper introduces DDXPlus to address existing gaps by including differential diagnoses that mirror clinical decision-making.
- The paper employs a rule-based system and hierarchical symptom organization to generate over 1.3 million detailed synthetic patient records.
- The paper demonstrates improved performance in ASD/AD models, validating the dataset’s potential to enhance real-world clinical applications.
An Overview of DDXPlus: A Novel Dataset for Automatic Medical Diagnosis
The paper "DDXPlus: A New Dataset For Automatic Medical Diagnosis" introduces DDXPlus, a synthetic dataset poised to advance research in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems. The dataset addresses significant gaps in existing medical diagnosis datasets by including data not only on symptoms and antecedents but also on differential diagnoses, thus providing a more comprehensive tool for model training.
Key Contributions
The DDXPlus dataset represents a major step forward by merging multiple features that are either missing or incomplete in existing datasets. These contributions are highlighted as follows:
- Inclusion of Differential Diagnosis:
- Most existing datasets focus solely on the binary classification of symptoms to diagnose a single disease. In reality, medical practitioners use a list of potential diagnoses, known as a differential diagnosis. DDXPlus includes this integral piece of information, thereby enabling models to train on how clinicians inherently work with uncertainties.
- Diverse Types of Evidences:
- In contrast to traditional datasets that primarily use binary symptoms, DDXPlus incorporates categorical and multi-choice symptoms and antecedents. This diversification aligns better with actual clinical questioning patterns and enhances data collection efficiency.
- Hierarchical Organization of Symptoms:
- Some symptoms are structured hierarchically, which can help models simulate the logical progression of a medical interview more accurately.
- Proprietary Medical Knowledge Base:
- The dataset is synthesized using data from an extensive medical knowledge base with over 20,000 medical papers, ensuring high fidelity of generated patient profiles. It spans 440 pathologies and 802 symptoms/antecedents but initially focuses on a subset related to cough, sore throat, and breathing issues, covering 49 pathologies, 110 symptoms, and 113 antecedents.
- Scale and Variety:
- DDXPlus includes approximately 1.3 million synthetic patient records, providing extensive data for training robust and diverse machine learning models.
Dataset Characteristics and Generation Process
The dataset comprises numerous attributes for each patient including the patient's age, sex, geographical region, symptoms, antecedents, the ground truth pathology, and corresponding differential diagnosis. The generation process follows two primary steps:
- Patient Data Synthesis:
- Using a proprietary medical knowledge base, public census data, and Synthea, patient demographics, diseases, symptoms, and antecedents are synthesized. Attention to different types of symptoms and sophisticated hierarchical organization allows for a rich, detailed dataset.
- Differential Diagnosis Generation:
- Differential diagnoses for each synthesized patient are generated via a commercial rule-based AD system by cross-referencing the presented symptoms and antecedents against the rules embedded in the medical knowledge base.
Empirical Evaluation
Two existing models, AARLC (Adaptive Alignment of Reinforcement Learning and Classification) and BASD (Baseline ASD), were extended to leverage differential diagnoses as training signals, demonstrating considerable performance improvements:
- AARLC Model Enhancements:
- When trained to predict differentials, the AARLC model showed significant improvements in differential diagnosis recall (DDR), precision (DDP), and F1 scores (DDF1), outperforming its earlier single pathology classification approach.
- The interaction lengths (IL) increased, indicating the model pursued a more comprehensive evidence collection.
- BASD Model Adjustments:
- While BASD's positive evidence recall (PER) did not improve remarkably, the model's ability to render differential diagnoses demonstrated better alignment with medical reasoning, showcasing enhanced DDR, DDP, and DDF1 metrics.
Implications and Future Directions
The inclusion of differential diagnoses within the dataset not only improves the training efficacy of ASD/AD systems but also extends their applicability within clinical settings. By training models on how real clinicians assess multiple plausible pathologies, DDXPlus can lead to the development of systems that clinicians trust and use seamlessly.
Future research can leverage DDXPlus for various explorations:
- Refining the models to handle hierarchical symptom structures.
- Incorporating severity levels into model training for prioritization of high-risk pathologies.
- Extending the dataset to include all 440 pathologies and examining its impact on model robustness.
- Investigating machine learning techniques that efficiently integrate both positive and negative evidence collection.
Conclusion
DDXPlus marks a significant advancement in the availability and comprehensiveness of datasets for ASD and AD research. By closely mirroring real-life clinical diagnostic processes and interactions, DDXPlus enables the development of diagnostic systems that align with and enhance clinical practice, providing a platform for significant future advancements in medical AI.