Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space (2310.09656v3)
Abstract: Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
- How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pp. 290–306. PMLR, 2022.
- Generating synthetic data in finance: Opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, ICAIF ’20. Association for Computing Machinery, 2021. ISBN 9781450375849.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
- Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
- Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
- Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 2672–2680, 2014.
- Revisiting deep learning models for tabular data. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 18932–18943, 2021.
- Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493:28–45, 2022.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In The Forth International Conference on Learning Representations, 2016.
- Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 6840–6851, 2020.
- Argmax flows and multinomial diffusion: Learning categorical distributions. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 12454–12465, 2021.
- Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 26565–26577, 2022.
- Sos: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 762–772, 2022.
- Stasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp. 17564–17579. PMLR, 2023.
- Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. In International Conference on Machine Learning, pp. 18940–18956. PMLR, 2023.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
- Goggle: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023b.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.
- Generating diverse high-fidelity images with vq-vae-2. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 14866–14876, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Denoising diffusion implicit models. In The Ninth International Conference on Learning Representations, 2021a.
- Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations, 2021b.
- Score-based generative modeling in latent space. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 11287–11302, 2021.
- Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318, 2017.
- Modeling tabular data using conditional gan. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 7335–7345, 2019.
- Diffusion models for missing value imputation in tabular data. arXiv preprint arXiv:2210.17128, 2022.