MissDiff: Training Diffusion Models on Tabular Data with Missing Values (2307.00467v1)
Abstract: The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed data for training. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work presents a unified and principled diffusion-based framework for learning from data with missing values under various missing mechanisms. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. Then we propose to mask the regression loss of Denoising Score Matching in the training phase. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases. The proposed framework is evaluated on multiple tabular datasets using realistic and efficacious metrics and is demonstrated to outperform state-of-the-art diffusion model on tabular data with "impute-then-generate" pipeline by a large margin.
- Personalized risk scoring for critical care prognosis using mixtures of gaussian processes. IEEE Transactions on Biomedical Engineering, 65:207–218, 2016.
- Synthetic data from diffusion models improves imagenet classification. ArXiv, abs/2304.08466, 2023.
- Prediction with missing data. ArXiv, abs/2104.03158, 2021.
- From predictive methods to missing data imputation: An optimization approach. J. Mach. Learn. Res., 18:196:1–196:39, 2017.
- Datawig: Missing value imputation for tables. J. Mach. Learn. Res., 20:175:1–175:6, 2019.
- Generating multi-label discrete electronic health records using generative adversarial networks. ArXiv, abs/1703.06490, 2017.
- DataCebo, Inc. Synthetic Data Metrics, 4 2023. Version 0.9.3.
- Diffusion models beat gans on image synthesis. ArXiv, abs/2105.05233, 2021.
- How to train your neural ode: the world of jacobian and kinetic regularization. In International Conference on Machine Learning, 2020.
- Improving robustness using generated data. In NeurIPS, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Video diffusion models. ArXiv, abs/2204.03458, 2022.
- Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, 2005.
- How to deal with missing data in supervised deep learning? In International Conference on Learning Representations, 2020.
- not-miwae: Deep generative modelling with missing not at random data. ArXiv, abs/2006.12871, 2020.
- Oct-gan: Neural ode-based conditional tabular gans. Proceedings of the Web Conference 2021, 2021.
- Stasy: Score-based tabular data synthesis. 2023.
- Census income data set. https://archive.ics.uci.edu/ml/datasets/census+income, 1996.
- Tabddpm: Modelling tabular data with diffusion models. ArXiv, abs/2209.15421, 2022.
- Misgan: Learning from incomplete data with generative adversarial networks. ArXiv, abs/1902.09599, 2019.
- Steven Cheng-Xian Li and Benjamin M Marlin. Learning from irregularly-sampled time series: A missing data perspective. In International Conference on Machine Learning, 2020.
- Roderick J. A. Little and Donald B. Rubin. Statistical analysis with missing data. 1988.
- Diffusion probabilistic models for 3d point cloud generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2836–2844, 2021.
- VAEM: a deep generative model for heterogeneous mixed type data. ArXiv, abs/2006.11941, 2020.
- Miwae: Deep generative modelling and imputation of incomplete data sets. In International Conference on Machine Learning, 2019.
- Concrete score matching: Generalized score matching for discrete data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Missing data imputation using optimal transport. In International Conference on Machine Learning, 2020.
- Handling incomplete heterogeneous data using vaes. Pattern Recognit., 107:107501, 2018.
- From missing data imputation to data generation. J. Comput. Sci., 61:101640, 2022.
- Bernt Øksendal. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987.
- Improving adversarial robustness by contrastive guided diffusion process. ArXiv, abs/2210.09643, 2022.
- Data synthesis based on generative adversarial networks. Proc. VLDB Endow., 11:1071–1083, 2018.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- Robust learning meets generative models: Can proxy distributions improve adversarial robustness? In ICLR, 2022.
- Maximum likelihood training of score-based diffusion models. In Neural Information Processing Systems, 2021.
- Sliced score matching: A scalable approach to density and score estimation. In Conference on Uncertainty in Artificial Intelligence, 2019.
- Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2021.
- Veegan: Reducing mode collapse in gans using implicit variational learning. In NIPS, 2017.
- Daniel J. Stekhoven. missforest: Nonparametric missing value imputation using random forest. 2015.
- Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2023.
- Csdi: Conditional score-based diffusion models for probabilistic time series imputation. ArXiv, abs/2107.03502, 2021.
- General latent feature models for heterogeneous datasets. J. Mach. Learn. Res., 21:100:1–100:49, 2017.
- Stef van Buuren and Karin G. M. Groothuis-Oudshoorn. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67, 2011.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661–1674, 2011.
- Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, 2008.
- Pc-gain: Pseudo-label conditional generative adversarial imputation networks for incomplete data. Neural networks : the official journal of the International Neural Network Society, 141:395–403, 2020.
- Benchmarking emergency department prediction models with machine learning and public electronic health records. Scientific Data, 9, 2022.
- Modeling tabular data using conditional gan. In Neural Information Processing Systems, 2019.
- Discovery and clinical decision support for personalized healthcare. IEEE Journal of Biomedical and Health Informatics, 21:1133–1145, 2017.
- Gain: Missing data imputation using generative adversarial nets. ArXiv, abs/1806.02920, 2018.
- Personalized survival predictions via trees of predictors: An application to cardiac transplantation. PLoS ONE, 13, 2018.
- Diffusion models and semi-supervised learners benefit mutually with few labels. ArXiv, abs/2302.10586, 2023.
- Diffusion models for missing value imputation in tabular data. ArXiv, abs/2210.17128, 2022.
- Yidong Ouyang (6 papers)
- Liyan Xie (34 papers)
- Chongxuan Li (75 papers)
- Guang Cheng (136 papers)