Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Usefulness of Synthetic Tabular Data Generation (2306.15636v1)

Published 27 Jun 2023 in cs.LG

Abstract: Despite recent advances in synthetic data generation, the scientific community still lacks a unified consensus on its usefulness. It is commonly believed that synthetic data can be used for both data exchange and boosting ML training. Privacy-preserving synthetic data generation can accelerate data exchange for downstream tasks, but there is not enough evidence to show how or why synthetic data can boost ML training. In this study, we benchmarked ML performance using synthetic tabular data for four use cases: data sharing, data augmentation, class balancing, and data summarization. We observed marginal improvements for the balancing use case on some datasets. However, we conclude that there is not enough evidence to claim that synthetic tabular data is useful for ML training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Generative oversampling for imbalanced data via majority-guided vae. In International Conference on Artificial Intelligence and Statistics, pp.  3315–3330. PMLR, 2023.
  2. Language models are realistic tabular data generators. In 11th International Conference on Learning Representations, ICLR 2023, 2023.
  3. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  4. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  5. UCI machine learning repository. 2017.
  6. Neural spline flows. In Advances in Neural Information Processing Systems, 2019.
  7. To SMOTE, or not to SMOTE? CoRR, abs/2201.08528, 2022. URL https://arxiv.org/abs/2201.08528.
  8. AutoGluon-Tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505, 2020.
  9. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  10. SOS: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pp.  762–772, 2022.
  11. STaSy: Score-based tabular data synthesis. In 11th International Conference on Learning Representations, ICLR 2023, 2023.
  12. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  13. Kohavi, R. et al. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In KDD, volume 96, pp.  202–207, 1996.
  14. TabDDPM: Modelling tabular data with diffusion models. arXiv preprint arXiv:2209.15421, 2022.
  15. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017. URL http://jmlr.org/papers/v18/16-365.
  16. GOGGLE: Generative modelling for tabular data by learning relational structure. In International Conference on Learning Representations, 2023.
  17. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  18. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res., 22(1), jan 2021. ISSN 1532-4435.
  19. Synthcity: facilitating innovative use cases of synthetic data in different data modalities, 2023. URL https://arxiv.org/abs/2301.07573.
  20. Inductive knowledge acquisition: a case study. In Proceedings of the Second Australian Conference on Applications of expert systems, pp.  137–156, 1987.
  21. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS Energy Efficient Machine Learning and Cognitive Computing Workshop, 2019.
  22. DECAF: Generating fair synthetic data using causally-aware generative networks. In Advances in Neural Information Processing Systems, 2021.
  23. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  24. Private synthetic data for multitask learning and marginal queries. Advances in Neural Information Processing Systems, 2022.
  25. Modeling tabular data using conditional gan. In Advances in Neural Information Processing Systems, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Dionysis Manousakas (4 papers)
  2. Sergül Aydöre (1 paper)
Citations (6)