SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models (2404.03299v1)
Abstract: Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.
- M. De Hert, J. Detraux, and D. Vancampfort, “The intriguing relationship between coronary heart disease and mental disorders,” Dialogues in clinical neuroscience, 2022.
- P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council,” Regulation (eu), vol. 679, p. 2016, 2016.
- J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 639–644.
- Jaideep Vaidya and Chris Clifton, “Privacy-preserving k-means clustering over vertically partitioned data,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 206–215.
- S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical partitioning algorithms for database design,” ACM Trans. Database Syst., vol. 9, no. 4, p. 680–710, dec 1984. [Online]. Available: https://doi.org/10.1145/1994.2209
- A. Sanghi and J. R. Haritsa, “Synthetic data generation for enterprise dbms,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023, pp. 3585–3588.
- W. Li, “Supporting database constraints in synthetic data generation based on generative adversarial networks,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 2875–2877. [Online]. Available: https://doi.org/10.1145/3318464.3384414
- J. Fan, J. Chen, T. Liu, Y. Shen, G. Li, and X. Du, “Relational data synthesis using generative adversarial networks: a design space exploration,” Proc. VLDB Endow., vol. 13, no. 12, p. 1962–1975, jul 2020. [Online]. Available: https://doi.org/10.14778/3407790.3407802
- N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim, “Data synthesis based on generative adversarial networks,” Proc. VLDB Endow., vol. 11, no. 10, p. 1071–1083, jun 2018. [Online]. Available: https://doi.org/10.14778/3231751.3231757
- E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Generating multi-label discrete patient records using generative adversarial networks,” in Machine learning for healthcare conference. PMLR, 2017, pp. 286–305.
- A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” arXiv preprint arXiv:1711.00937, 2017.
- J. Lee, M. Kim, Y. Jeong, and Y. Ro, “Differentially private normalizing flows for synthetic tabular data generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7345–7353.
- S. Mahiou, K. Xu, and G. Ganev, “dpart: Differentially private autoregressive tabular, a general framework for synthetic data generation,” arXiv preprint arXiv:2207.05810, 2022.
- J. Kim, C. Lee, and N. Park, “Stasy: Score-based tabular data synthesis,” arXiv preprint arXiv:2210.04018, 2022.
- L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,” Advances in neural information processing systems, vol. 32, 2019.
- Z. Wang, X. Cheng, S. Su, and G. Wang, “Differentially private generative decomposed adversarial network for vertically partitioned data sharing,” Information Sciences, vol. 619, pp. 722–744, 2023.
- Z. Zhao, H. Wu, A. Van Moorsel, and L. Y. Chen, “Gtv: Generating tabular data via vertical federated learning,” arXiv preprint arXiv:2302.01706, 2023.
- Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen, “CTAB-GAN: Effective table data synthesizing,” in Asian Conference on Machine Learning. PMLR, 2021, pp. 97–112.
- A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabddpm: Modelling tabular data with diffusion models,” in International Conference on Machine Learning. PMLR, 2023, pp. 17 564–17 579.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 454–12 465, 2021.
- P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
- R. Bayat, “A study on sample diversity in generative models: Gans vs. diffusion models,” 2023.
- T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE signal processing magazine, vol. 37, no. 3, pp. 50–60, 2020.
- J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” Advances in neural information processing systems, vol. 25, 2012.
- P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” arXiv preprint arXiv:1711.10677, 2017.
- M. Scannapieco, I. Figotin, E. Bertino, and A. K. Elmagarmid, “Privacy preserving schema and data matching,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 2007, pp. 653–664.
- X. Jin, P.-Y. Chen, C.-Y. Hsu, C.-M. Yu, and T. Chen, “Cafe: Catastrophic data leakage in vertical federated learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 994–1006, 2021.
- L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances in neural information processing systems, vol. 32, 2019.
- J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients-how easy is it to break privacy in federated learning?” Advances in Neural Information Processing Systems, vol. 33, pp. 16 937–16 947, 2020.
- K. Wei, J. Li, C. Ma, M. Ding, S. Wei, F. Wu, G. Chen, and T. Ranbaduge, “Vertical federated learning: Challenges, methodologies and experiments,” arXiv preprint arXiv:2202.04309, 2022.
- Y. Liu, Y. Kang, T. Zou, Y. Pu, Y. He, X. Ye, Y. Ouyang, Y.-Q. Zhang, and Q. Yang, “Vertical federated learning,” arXiv preprint arXiv:2211.12814, 2022.
- Nash,Warwick, Sellers,Tracy, Talbot,Simon, Cawthorn,Andrew, and Ford,Wes, “Abalone,” UCI Machine Learning Repository, 1995, DOI: https://doi.org/10.24432/C55C7W.
- B. Becker and R. Kohavi, “Adult,” UCI Machine Learning Repository, 1996, DOI: https://doi.org/10.24432/C5XW20.
- S. Ulianova, “Cardiovascular disease dataset,” Jan 2019. [Online]. Available: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
- ShrutiIyyer, “Churn modelling,” Apr 2019. [Online]. Available: https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling
- J. Blackard, “Covertype,” UCI Machine Learning Repository, 1998, DOI: https://doi.org/10.24432/C50K5N.
- [Online]. Available: https://www.openml.org/search?type=data&sort=runs&id=37&status=active
- S. Bhosale, “Network Intrusion Detection,” 2018. [Online]. Available: https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
- Habilmohammed, “Personal loan campaign - classification,” Jul 2020. [Online]. Available: https://www.kaggle.com/code/habilmohammed/personal-loan-campaign-classification
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.
- P. Sedgwick, “Pearson’s correlation coefficient,” Bmj, vol. 345, 2012.
- F. Bliemel, “Theil’s forecast accuracy coefficient: A clarification,” 1973.
- M. Menéndez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,” Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997.
- V. W. Berger and Y. Zhou, “Kolmogorov–smirnov test: Overview,” Wiley statsref: Statistics reference online, 2014.
- A. Olmos and P. Govindasamy, “Propensity scores: a practical introduction using r,” Journal of MultiDisciplinary Evaluation, vol. 11, no. 25, pp. 68–88, 2015.
- T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
- M. Giomi, F. Boenisch, C. Wehmeyer, and B. Tasnádi, “A unified framework for quantifying privacy risk in synthetic data,” arXiv preprint arXiv:2211.10459, 2022.
- F. Houssiau, J. Jordon, S. N. Cohen, O. Daniel, A. Elliott, J. Geddes, C. Mole, C. Rangel-Smith, and L. Szpruch, “Tapas: A toolbox for adversarial privacy auditing of synthetic data,” arXiv preprint arXiv:2211.06550, 2022.
- R. Tajeddine, J. Jälkö, S. Kaski, and A. Honkela, “Privacy-preserving data sharing on vertically partitioned data,” arXiv preprint arXiv:2010.09293, 2020.
- S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks, “Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models,” IEEE transactions on pattern analysis and machine intelligence, 2021.