Generating Multi-label Discrete Patient Records using Generative Adversarial Networks (1703.06490v3)

Published 19 Mar 2017 in cs.LG and cs.NE

Abstract: Access to electronic health record (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records. Based on input real patient records, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) via a combination of an autoencoder and generative adversarial networks. We also propose minibatch averaging to efficiently avoid mode collapse, and increase the learning efficiency with batch normalization and shortcut connections. To demonstrate feasibility, we showed that medGAN generates synthetic patient records that achieve comparable performance to real data on many experiments including distribution statistics, predictive modeling tasks and a medical expert review. We also empirically observe a limited privacy risk in both identity and attribute disclosure using medGAN.

PDF Abstract

Overview of medGAN: Generating Synthetic Patient Records

The paper presents an innovative approach to generating synthetic electronic health records (EHR) using a Generative Adversarial Network model named medGAN. This approach addresses the challenge of generating realistic, high-dimensional, and multi-label discrete patient records while maintaining privacy. With the increasing need for medical data in computational research amidst privacy concerns, medGAN offers a potential solution for researchers by generating synthetic, yet realistic, patient data.

Key Contributions and Methodology

The primary contribution of this paper lies in the development of medGAN, which effectively uses a combination of an autoencoder and a GAN framework. This method aims to overcome limitations in utility and realism associated with previous synthetic data generation strategies. The paper highlights several advancements:

Handling Discrete Variables: medGAN is particularly designed to handle both binary and count variable types found in EHRs. Unlike traditional GANs, which are suited for continuous data, medGAN processes high-dimensional discrete variables which include diagnosis and treatment codes.
Autoencoder Integration: The incorporation of an autoencoder is crucial. It learns salient features of the discrete data, facilitating the generator in producing realistic data that can later be decoded into discrete outputs.
Minibatch Averaging: This technique is introduced to counter mode collapse, a common issue in GANs where the generator produces a limited variety of outputs. Minibatch averaging provides robust statistical information across samples, enabling more diverse and realistic outputs.
Enhanced Learning Efficiency: By integrating batch normalization and shortcut connections within the generator network, medGAN achieves improved learning efficiency and stability.

Experimental Evaluations

The paper rigorously evaluates the performance of medGAN on real-world datasets. Key experiments include:

Dimension-wise Probability and Prediction: Using datasets from Sutter Health and MIMIC-III, medGAN’s generated data are benchmarked against actual patient data on tasks such as distribution statistics and predictive accuracy. The results demonstrate that medGAN effectively captures both independent and interdependent relationships between medical records.
Qualitative Expert Review: A medical expert's assessment of synthetic records further supports the perceived realism of medGAN-generated data, showcasing its potential credibility for practical use in research.
Privacy Risk Assessment: The paper analyzes medGAN's ability to mitigate risks in identity and attribute disclosures. The results indicate a low risk of information leakage, suggesting medGAN’s efficacy in protecting patient privacy while generating useful synthetic data.

Implications and Future Directions

medGAN presents significant implications for both privacy preservation and research utility. By enabling accessible yet protected synthetic data generation, it facilitates data-driven research advancements without compromising individual privacy—critical in domains requiring sensitive data like healthcare.

Future advancements could explore longitudinal data generation, thereby paving the way for more dynamic and temporal analyses in medical research applications. Incorporating multi-modal data types, such as lab results and textual data, might enhance the robustness and applicability of medGAN.

In conclusion, medGAN represents a valuable tool for advancing medical research while safeguarding patient privacy. Its dual focus on empirical rigor and practical relevance positions it as a notable contribution to the field of synthetic data generation.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Edward Choi (90 papers)
Siddharth Biswal (10 papers)
Bradley Malin (22 papers)
Jon Duke (4 papers)
Walter F. Stewart (8 papers)
Jimeng Sun (181 papers)

Citations (514)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - mp2893/medgan: Generative adversarial network for generating electronic health records. (276 stars)