Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis (2401.13756v1)

Published 24 Jan 2024 in cs.LG

Abstract: This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.

Citations (2)

Summary

  • The paper demonstrates that integrating NLICE's detailed symptom modeling with contextual information significantly improves machine learning model accuracy in differential diagnosis.
  • The paper employs a hybrid approach combining public datasets with Synthea and augmented attributes via NLICE to enrich synthetic patient records.
  • The paper reports a notable improvement in Top-1 accuracy—from 58.8% to 82.0%—using Naive Bayes and Random Forest models on NLICE-enhanced datasets.

NLICE: Enhancing Synthetic Medical Records for Differential Diagnosis with Contextual Information

Introduction to Synthetic Medical Record Generation

The challenge of differential diagnosis in primary healthcare settings is heightened by the scarcity of large, accurate medical datasets, which stems from the justifiable need to protect patient privacy. Meanwhile, advances in ML have shown potential to assist in diagnosing diseases based on presented symptoms. This paper introduces a novel method for generating synthetic patient records utilizing a medically standardized symptom modeling approach named NLICE. It proposes an innovative approach to create expressive synthetic data by integrating additional contextual information for each medical condition. This data is then used to train and evaluate ML models, namely Naive Bayes and Random Forest, for their capacity to support differential diagnosis activities.

Generation Methodology

The method employed combines using SymCat, a public symptom-condition database, with Synthea, a patient record simulator, to generate baseline synthetic datasets. This baseline data, however, lacks depth in symptom representation. To address this, the paper introduces the NLICE method for symptom modeling, which adds further dimensions to a symptom's representation — Nature, Location, Intensity, Chronology, and Excitation. These additional attributes aim to increase the discriminative power of the synthetic medical records markedly.

Machine Learning Models and Evaluation

Naive Bayes and Random Forest models were selected for their prevalence in medical domain applications. The evaluation criteria established were Top-1 accuracy, precision, and Top-5 accuracy of model predictions. Remarkably, the NLICE-based dataset showcased superior results compared to the SymCat-based dataset. In particular, Top-1 accuracy improved from 58.8% to 82.0% with Naive Bayes and from 57.1% to 82.0% with Random Forest, showing the substantial impact of integrating the NLICE approach into synthetic data generation.

Realistic Scenario Simulation

The paper further explores the behavior of these models under varied realistic scenarios such as modifying the minimum number of symptoms per condition, perturbing condition-symptom probabilities, and injecting additional symptoms. These scenarios aim to simulate potential real-world deviations from the synthetic dataset's assumptions. The findings emphasize the robustness of the models trained on the NLICE-augmented dataset, particularly highlighting the efficacy of the Random Forest model in adapting to these variances.

Implications and Future Directions

The proposed NLICE method for enhancing synthetic medical records presents significant implications for advancing AI applications within healthcare. By providing a more detailed and medically accurate representation of symptoms, this approach can help overcome the challenge of insufficient training data for ML models in medical diagnostics. The successful application and evaluation of this method pave the way for further exploration into more sophisticated ML algorithms that can leverage the rich information offered by the NLICE-enhanced datasets. Future research will focus on expanding the NLICE method to encompass a broader range of medical conditions and symptoms. Additionally, exploring more complex ML architectures, such as autoencoders, may offer new insights into improving diagnostic accuracy further.

Acknowledgments

The project's success owes much to the support of the EFRO Werk!Werkt and the Eureka Xecs TASTI project. The open-source availability of the NLICE codebase encourages further contributions and advancements from the research community in applying AI to healthcare diagnostics more reliably and effectively.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets