Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models (2405.17724v2)

Published 28 May 2024 in cs.AI

Abstract: Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pages 290–306. PMLR, 2022.
  2. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8, 2020.
  3. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737–1767, 2013.
  4. P. Berka et al. Guide to the financial data set. PKDD2000 discovery challenge, 2000.
  5. Privlava: synthesizing relational data with foreign keys under differential privacy. Proceedings of the ACM on Management of Data, 1(2):1–25, 2023.
  6. M. Center. Integrated public use microdata series, international: Version 7.3 [data set]. minneapolis, mn: Ipums, 2020.
  7. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  8. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  9. DataCebo. Hmasynthesizer - synthetic data vault, 2023. Accessed on: May 20, 2024.
  10. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  11. Differentially private diffusion models. arXiv preprint arXiv:2210.09929, 2022.
  12. W. Dong and K. Yi. Residual sensitivity for differentially private multi-way joins. In Proceedings of the 2021 International Conference on Management of Data, pages 432–444, 2021.
  13. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006.
  14. J. Fonseca and F. Bacao. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
  15. Kamino: constraint-aware differentially private data synthesis. Proc. VLDB Endow., 14(10):1886–1899, jun 2021.
  16. Differentially private diffusion models generate useful synthetic images. ArXiv, abs/2302.13861, 2023.
  17. Differentially private data release over multiple tables. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’23, page 207–219, New York, NY, USA, 2023. Association for Computing Machinery.
  18. Synthetic data in health care: A narrative review. PLOS Digital Health, 2(1):1–16, 01 2023.
  19. Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493:28–45, 2022.
  20. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  21. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  22. Computing complex temporal join queries efficiently. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 2076–2090, New York, NY, USA, 2022. Association for Computing Machinery.
  23. Instacart market basket analysis, 2017.
  24. Stasy: Score-based tabular data synthesis. arXiv preprint arXiv:2210.04018, 2022.
  25. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
  26. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine, 30(4):1166–1173, apr 2024.
  27. N. Lawrence and A. Hyvärinen. Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of machine learning research, 6(11), 2005.
  28. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. In International Conference on Machine Learning, pages 18940–18956. PMLR, 2023.
  29. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data, pages 615–629, 2016.
  30. P. Li and S. Chen. A review on gaussian process latent variable models. CAAI Transactions on Intelligence Technology, 1(4):366–376, 2016.
  31. Generating realistic synthetic relational data through graph variational autoencoders. arXiv preprint arXiv:2211.16889, 2022.
  32. J. Motl and O. Schulte. The ctu prague relational learning repository. arXiv preprint arXiv:1511.03086, 2015.
  33. H. Nickisch and C. E. Rasmussen. Gaussian mixture modeling with gaussian process latent variable models. In Joint Pattern Recognition Symposium, pages 272–282. Springer, 2010.
  34. NIST. 2018 differential privacy synthetic data challenge, 2018. Accessed: 2024-05-17.
  35. The synthetic data vault. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 399–410. IEEE, 2016.
  36. Synthetic data applications in finance. arXiv preprint arXiv:2401.00081, 2024.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  38. O. Schulte and Z. Qian. Factorbase: multi-relational structure learning with sql all the way. In International Journal of Data Science and Analytics, Int J Data Sci Anal 7, page 289–309. Springer International Publishing AG, 2019.
  39. Fast learning of relational dependency networks. Machine Learning, 103:377–406, 2016.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  43. Synthetic data – anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22), pages 1451–1468, Boston, MA, Aug. 2022. USENIX Association.
  44. Decaf: Generating fair synthetic data using causally-aware generative networks. In Advances in Neural Information Processing Systems, volume 34, pages 22221–22233. Curran Associates, Inc., 2021.
  45. Can you rely on your model evaluation? improving model evaluation with synthetic test data. In Advances in Neural Information Processing Systems, 2023.
  46. Beyond privacy: Navigating the opportunities and challenges of synthetic data. arXiv preprint arXiv:2304.03722, 2023.
  47. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  48. Using bayesian networks to create synthetic data. Journal of Official Statistics, 25(4):549–567, Dec 2009.
  49. Mixture of gaussians and em. https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/13_mog.pdf.
  50. Mixed-type tabular data synthesis with score-based diffusion in latent space. arXiv preprint arXiv:2310.09656, 2023.
  51. PrivSyn: Differentially private data synthesis. In 30th USENIX Security Symposium (USENIX Security 21), pages 929–946. USENIX Association, Aug. 2021.
  52. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
  53. S. Zheng and N. Charoenphakdee. Diffusion models for missing value imputation in tabular data. arXiv preprint arXiv:2210.17128, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com