Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured Evaluation of Synthetic Tabular Data (2403.10424v2)

Published 15 Mar 2024 in cs.LG and stat.ML

Abstract: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Margctgan: A” marginally”better ctgan for the low sample regime. arXiv preprint arXiv:2307.07997, 2023.
  2. David J. Aldous. Exchangeability and related topics. In P. L. Hennequin (ed.), École d’Été de Probabilités de Saint-Flour XIII — 1983, pp.  1–198, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg. ISBN 978-3-540-39316-0.
  3. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
  4. Charles E Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics, pp.  1152–1174, 1974.
  5. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022a.
  6. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022b.
  7. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pp.  286–305. PMLR, 2017.
  8. A universal metric for robust evaluation of synthetic tabular data. IEEE Transactions on Artificial Intelligence, 2022.
  9. David B Dahl. An improved merge-split sampler for conjugate dirichlet process mixture models. Technical Report, 1:086, 2003.
  10. A multi-dimensional evaluation of synthetic data generators. IEEE Access, 10:11147–11158, 2022.
  11. Synthetic Data Metrics. DataCebo, Inc., 10 2022. URL https://docs.sdv.dev/sdmetrics/. Version 0.8.0.
  12. Conditional wasserstein gan-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174:114582, 2021.
  13. Distributed inference for dirichlet process mixture models. In International Conference on Machine Learning, pp. 2276–2284. PMLR, 2015.
  14. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  15. Why do tree-based models still outperform deep learning on tabular data? arXiv preprint arXiv:2207.08815, 2022.
  16. Application of bayesian networks to generate synthetic health data. Journal of the American Medical Informatics Association, 28(4):801–811, 2021.
  17. Stasy: Score-based tabular data synthesis. arXiv preprint arXiv:2210.04018, 2022.
  18. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp. 17564–17579. PMLR, 2023.
  19. VAEM: a deep generative model for heterogeneous mixed type data. Advances in Neural Information Processing Systems, 33:11237–11247, 2020.
  20. Crosscat: a fully bayesian nonparametric method for analyzing heterogeneous, high dimensional data. The Journal of Machine Learning Research, 17(1):4760–4808, 2016.
  21. Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.
  22. synthpop: Bespoke creation of synthetic data in r. Journal of statistical software, 74:1–26, 2016.
  23. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.  399–410, Oct 2016. doi: 10.1109/DSAA.2016.49.
  24. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  25. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  26. Modeling tabular data using conditional GAN. Advances in Neural Information Processing Systems, 32, 2019.
  27. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):1–41, 2017.
  28. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pp.  97–112. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Scott Cheng-Hsin Yang (9 papers)
  2. Baxter Eaves (2 papers)
  3. Michael Schmidt (40 papers)
  4. Ken Swanson (1 paper)
  5. Patrick Shafto (28 papers)
Citations (1)