Guided Discrete Diffusion for Electronic Health Record Generation (2404.12314v2)
Abstract: Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems 34 17981–17993.
- Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association 26 228–241.
- Feasibility of using real-world data to replicate clinical trial evidence. JAMA network open 2 e1912869–e1912869.
- Eva: Generating longitudinal electronic health records using conditional variational autoencoders. In Machine Learning for Healthcare Conference. PMLR.
- Eva: Generating longitudinal electronic health records using conditional variational autoencoders. ArXiv abs/2012.10020.
- Data-driven approach for creating synthetic electronic medical records. BMC Medical Informatics and Decision Making 10 59 – 59.
- A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems 35 28266–28279.
- Synthesizing mixed-type electronic health records using diffusion models. arXiv preprint arXiv:2302.14679 .
- Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 .
- Fast sampling via de-randomization for discrete diffusion models. arXiv preprint arXiv:2312.09193 .
- Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference. PMLR.
- Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the 2nd Machine Learning for Healthcare Conference (F. Doshi-Velez, J. Fackler, D. Kale, R. Ranganath, B. Wallace and J. Wiens, eds.), vol. 68 of Proceedings of Machine Learning Research. PMLR.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164 .
- Chronic kidney disease prediction using machine learning techniques. Journal of Big Data 9 109.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 8780–8794.
- Are diffusion models vulnerable to membership inference attacks? arXiv preprint arXiv:2302.01316 .
- Generative adversarial nets. Advances in neural information processing systems 27.
- Protein design with guided discrete diffusion. arXiv preprint arXiv:2305.20009 .
- Improved training of wasserstein gans. Advances in neural information processing systems 30.
- Stein variational gradient descent without gradient. In International Conference on Machine Learning. PMLR.
- Meddiff: Generating electronic health records using accelerated denoising diffusion model.
- Boundary-seeking generative adversarial networks. arXiv preprint arXiv:1702.08431 .
- Denoising diffusion probabilistic models. Advances in neural information processing systems 33 6840–6851.
- Legal issues concerning electronic health information: privacy, quality, and liability. Jama 282 1466–1471.
- Autoregressive diffusion models. arXiv preprint arXiv:2110.02037 .
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems 34 12454–12465.
- Application of machine learning in predicting hospital readmissions: a scoping review of the literature. BMC medical research methodology 21 1–14.
- Mimic-iii, a freely accessible critical care database. Scientific Data 3.
- Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 .
- Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30.
- Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 .
- Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR.
- Chronic disease prediction using the common data model: development study. JMIR AI 1 e41030.
- Mmd gan: Towards deeper understanding of moment matching network. Advances in neural information processing systems 30.
- Behrt: transformer for electronic health records. Scientific reports 10 7155.
- Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems 29.
- Machine learning methods in health economics and outcomes research—the palisade checklist: a good practices report of an ispor task force. Value in health 25 1063–1080.
- Scalable and accurate deep learning with electronic health records. NPJ digital medicine 1 18.
- Fusion of graph and tabular deep learning models for predicting chronic kidney disease. Diagnostics 13 1981.
- Diffuser: Discrete diffusion via edit-based reconstruction. arXiv preprint arXiv:2210.16886 .
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE.
- Slee, V. N. (1978). The international classification of diseases: ninth revision (icd-9).
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 .
- Improved techniques for training score-based generative models. Advances in neural information processing systems 33 12438–12448.
- Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750 .
- On catastrophic forgetting and mode collapse in generative adversarial networks. ArXiv abs/1807.04015.
- Corgan: Correlation-capturing convolutional generative adversarial networks for generating synthetic healthcare records. arXiv preprint arXiv:2001.09346 .
- Corgan: Correlation-capturing convolutional generative adversarial networks for generating synthetic healthcare records. In The Florida AI Research Society.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 .
- Promptehr: Conditional electronic healthcare records generation with prompt learning. In Conference on Empirical Methods in Natural Language Processing.
- A multifaceted benchmarking of synthetic electronic health record generation models. Nature Communications 13.
- Diffusion language models can perform many tasks with scaling and instruction-finetuning. arXiv preprint arXiv:2308.12219 .
- Ehrdiff: Exploring realistic ehr synthesis with diffusion models. arXiv preprint arXiv:2303.05656 .
- Synteg: a framework for temporal structured electronic health data simulation. Journal of the American Medical Informatics Association : JAMIA .
- Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association 27 99–108.
- Zixiang Chen (28 papers)
- Jun Han (55 papers)
- Yongqian Li (6 papers)
- Yiwen Kou (6 papers)
- Eran Halperin (8 papers)
- Robert E. Tillman (5 papers)
- Quanquan Gu (198 papers)