An improved tabular data generator with VAE-GMM integration (2404.08434v2)
Abstract: The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.
- Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
- Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020.
- Midinet: A convolutional generative adversarial network for symbolic-domain music generation, 2017.
- Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment, 2017.
- Adversarial feature matching for text generation, 2017.
- Adversarial ranking for language generation. Advances in neural information processing systems, 30, 2017.
- Tabnet: Attentive interpretable tabular learning, 2020.
- Medgan: Medical image translation using gans. Computerized Medical Imaging and Graphics, 79:101684, January 2020.
- Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 11(10):1071–1083, June 2018.
- Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, page 1–21, 2024.
- Cwgan: Conditional wasserstein generative adversarial nets for fault data generation. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 2713–2718, 2019.
- Modeling tabular data using conditional gan, 2019.
- Ctab-gan: Effective table data synthesizing, 2021.
- Auto-encoding variational bayes, 2013.
- Vaem: a deep generative model for heterogeneous mixed type data, 2020.
- N Kostantinos. Gaussian mixtures and their applications to signal processing. Advanced signal processing handbook: theory and implementation for radar, sonar, and medical imaging real time systems, pages 3–1, 2000.
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
- Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
- Ron Kohavi. Census Income. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5GP7S.
- The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016.
- The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nature communications, 7:11479, May 2016.
- Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), February 2018.
- Survival Analysis Techniques for Censored and Truncated Data. Second edition, 2003.
- D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187–220, 1972.
- A time-dependent discrimination index for survival data. Statistics in Medicine, 24, 2005.
- Patricia A. Apellániz (4 papers)
- Juan Parras (5 papers)
- Santiago Zazo (17 papers)