Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis (2304.12654v2)

Published 25 Apr 2023 in cs.LG and cs.AI

Abstract: With growing attention to tabular data these days, the attempt to apply a synthetic table to various tasks has been expanded toward various scenarios. Owing to the recent advances in generative modeling, fake data generated by tabular data synthesis models become sophisticated and realistic. However, there still exists a difficulty in modeling discrete variables (columns) of tabular data. In this work, we propose to process continuous and discrete variables separately (but being conditioned on each other) by two diffusion models. The two diffusion models are co-evolved during training by reading conditions from each other. In order to further bind the diffusion models, moreover, we introduce a contrastive learning method with a negative sampling method. In our experiments with 11 real-world tabular datasets and 8 baseline methods, we prove the efficacy of the proposed method, called CoDi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  2. Generating synthetic but plausible healthcare record datasets, 2018.
  3. Knowledge acquisition and explanation for multi-attribute decision making. In 8th intl workshop on expert systems and their applications, pp.  59–78. Avignon France, 1988.
  4. Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889, 2021.
  5. Generating multi-label discrete electronic health records using generative adversarial networks. 2017.
  6. Real-valued (medical) time series generation with recurrent conditional gans, 2017.
  7. How to train your neural ode: the world of jacobian and kinetic regularization. In ICML, 2020.
  8. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  9. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  10. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  11. An empirical study on the membership inference attack against tabular data synthesis models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp.  4064–4068, 2022.
  12. Pate-gan: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
  13. Oct-gan: Neural ode-based conditional tabular gans. In TheWebConf, 2021.
  14. Stasy: Score-based tabular data synthesis. arXiv preprint arXiv:2210.04018, 2022a.
  15. Sos: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  762–772, 2022b.
  16. Invertible tabular GANs: Killing two birds with one stone for tabular data synthesis. In NeurIPS, 2021.
  17. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine learning, 40(3):203–228, 2000.
  18. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artificial intelligence in medicine, 101:101723, 2019.
  19. Application of a neuro fuzzy network in prediction of absenteeism at work. In 7th Iberian Conference on Information Systems and Technologies (CISTI 2012), pp.  1–4. IEEE, 2012.
  20. An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions, pp.  492–497. IEEE, 2012.
  21. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014.
  22. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pp. 7176–7185. PMLR, 2020.
  23. OECD. Health at a Glance 2021. 2021. doi: https://doi.org/https://doi.org/10.1787/ae3016b9-en. URL https://www.oecd-ilibrary.org/content/publication/ae3016b9-en.
  24. An application for admission in public school systems. Expert Systems in Public Administration, 1:145–160, 1989.
  25. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico. Data in brief, 25:104344, 2019.
  26. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018.
  27. The synthetic data vault. In DSAA, 2016.
  28. provided by Semeion, D. Research center of sciences of communication, via sersale 117, 00128, rome, italy, 2019.
  29. Reiter, P. J. Using cart to generate partially synthetic, public use microdata. Journal of Official Statistics, 21:441, 01 2005.
  30. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  31. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  815–823, 2015.
  32. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  33. Sikora, M. et al. Application of rule induction algorithms for analysis of data collected by seismic hazard monitoring systems in coal mines. Archives of Mining Sciences, 55(1):91–114, 2010.
  34. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  35. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Vurkaç, M. A cross–cultural grammar for temporal harmony in afro–latin musics: Clave, partido–alto and other timelines. 2012.
  38. Tackling the generative learning trilemma with denoising diffusion GANs. In ICLR, 2022.
  39. Modeling tabular data using conditional gan. In NeurIPS. 2019.
  40. Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
  41. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems, 2017.
  42. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pp.  97–112. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chaejeong Lee (5 papers)
  2. Jayoung Kim (9 papers)
  3. Noseong Park (78 papers)
Citations (38)

Summary

We haven't generated a summary for this paper yet.