Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models (2404.03299v1)

Published 4 Apr 2024 in cs.LG, cs.CR, cs.DB, and cs.DC

Abstract: Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. M. De Hert, J. Detraux, and D. Vancampfort, “The intriguing relationship between coronary heart disease and mental disorders,” Dialogues in clinical neuroscience, 2022.
  2. P. Regulation, “Regulation (eu) 2016/679 of the european parliament and of the council,” Regulation (eu), vol. 679, p. 2016, 2016.
  3. J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 639–644.
  4. Jaideep Vaidya and Chris Clifton, “Privacy-preserving k-means clustering over vertically partitioned data,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 206–215.
  5. S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical partitioning algorithms for database design,” ACM Trans. Database Syst., vol. 9, no. 4, p. 680–710, dec 1984. [Online]. Available: https://doi.org/10.1145/1994.2209
  6. A. Sanghi and J. R. Haritsa, “Synthetic data generation for enterprise dbms,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023, pp. 3585–3588.
  7. W. Li, “Supporting database constraints in synthetic data generation based on generative adversarial networks,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’20.   New York, NY, USA: Association for Computing Machinery, 2020, p. 2875–2877. [Online]. Available: https://doi.org/10.1145/3318464.3384414
  8. J. Fan, J. Chen, T. Liu, Y. Shen, G. Li, and X. Du, “Relational data synthesis using generative adversarial networks: a design space exploration,” Proc. VLDB Endow., vol. 13, no. 12, p. 1962–1975, jul 2020. [Online]. Available: https://doi.org/10.14778/3407790.3407802
  9. N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim, “Data synthesis based on generative adversarial networks,” Proc. VLDB Endow., vol. 11, no. 10, p. 1071–1083, jun 2018. [Online]. Available: https://doi.org/10.14778/3231751.3231757
  10. E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Generating multi-label discrete patient records using generative adversarial networks,” in Machine learning for healthcare conference.   PMLR, 2017, pp. 286–305.
  11. A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” arXiv preprint arXiv:1711.00937, 2017.
  12. J. Lee, M. Kim, Y. Jeong, and Y. Ro, “Differentially private normalizing flows for synthetic tabular data generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7345–7353.
  13. S. Mahiou, K. Xu, and G. Ganev, “dpart: Differentially private autoregressive tabular, a general framework for synthetic data generation,” arXiv preprint arXiv:2207.05810, 2022.
  14. J. Kim, C. Lee, and N. Park, “Stasy: Score-based tabular data synthesis,” arXiv preprint arXiv:2210.04018, 2022.
  15. L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional gan,” Advances in neural information processing systems, vol. 32, 2019.
  16. Z. Wang, X. Cheng, S. Su, and G. Wang, “Differentially private generative decomposed adversarial network for vertically partitioned data sharing,” Information Sciences, vol. 619, pp. 722–744, 2023.
  17. Z. Zhao, H. Wu, A. Van Moorsel, and L. Y. Chen, “Gtv: Generating tabular data via vertical federated learning,” arXiv preprint arXiv:2302.01706, 2023.
  18. Z. Zhao, A. Kunar, R. Birke, and L. Y. Chen, “CTAB-GAN: Effective table data synthesizing,” in Asian Conference on Machine Learning.   PMLR, 2021, pp. 97–112.
  19. A. Kotelnikov, D. Baranchuk, I. Rubachev, and A. Babenko, “Tabddpm: Modelling tabular data with diffusion models,” in International Conference on Machine Learning.   PMLR, 2023, pp. 17 564–17 579.
  20. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  21. E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 454–12 465, 2021.
  22. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
  23. R. Bayat, “A study on sample diversity in generative models: Gans vs. diffusion models,” 2023.
  24. T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE signal processing magazine, vol. 37, no. 3, pp. 50–60, 2020.
  25. J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
  26. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” Advances in neural information processing systems, vol. 25, 2012.
  27. P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018.
  28. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  29. S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” arXiv preprint arXiv:1711.10677, 2017.
  30. M. Scannapieco, I. Figotin, E. Bertino, and A. K. Elmagarmid, “Privacy preserving schema and data matching,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, 2007, pp. 653–664.
  31. X. Jin, P.-Y. Chen, C.-Y. Hsu, C.-M. Yu, and T. Chen, “Cafe: Catastrophic data leakage in vertical federated learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 994–1006, 2021.
  32. L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances in neural information processing systems, vol. 32, 2019.
  33. J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients-how easy is it to break privacy in federated learning?” Advances in Neural Information Processing Systems, vol. 33, pp. 16 937–16 947, 2020.
  34. K. Wei, J. Li, C. Ma, M. Ding, S. Wei, F. Wu, G. Chen, and T. Ranbaduge, “Vertical federated learning: Challenges, methodologies and experiments,” arXiv preprint arXiv:2202.04309, 2022.
  35. Y. Liu, Y. Kang, T. Zou, Y. Pu, Y. He, X. Ye, Y. Ouyang, Y.-Q. Zhang, and Q. Yang, “Vertical federated learning,” arXiv preprint arXiv:2211.12814, 2022.
  36. Nash,Warwick, Sellers,Tracy, Talbot,Simon, Cawthorn,Andrew, and Ford,Wes, “Abalone,” UCI Machine Learning Repository, 1995, DOI: https://doi.org/10.24432/C55C7W.
  37. B. Becker and R. Kohavi, “Adult,” UCI Machine Learning Repository, 1996, DOI: https://doi.org/10.24432/C5XW20.
  38. S. Ulianova, “Cardiovascular disease dataset,” Jan 2019. [Online]. Available: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
  39. ShrutiIyyer, “Churn modelling,” Apr 2019. [Online]. Available: https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling
  40. J. Blackard, “Covertype,” UCI Machine Learning Repository, 1998, DOI: https://doi.org/10.24432/C50K5N.
  41. [Online]. Available: https://www.openml.org/search?type=data&sort=runs&id=37&status=active
  42. S. Bhosale, “Network Intrusion Detection,” 2018. [Online]. Available: https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection
  43. Habilmohammed, “Personal loan campaign - classification,” Jul 2020. [Online]. Available: https://www.kaggle.com/code/habilmohammed/personal-loan-campaign-classification
  44. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.
  45. P. Sedgwick, “Pearson’s correlation coefficient,” Bmj, vol. 345, 2012.
  46. F. Bliemel, “Theil’s forecast accuracy coefficient: A clarification,” 1973.
  47. M. Menéndez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,” Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997.
  48. V. W. Berger and Y. Zhou, “Kolmogorov–smirnov test: Overview,” Wiley statsref: Statistics reference online, 2014.
  49. A. Olmos and P. Govindasamy, “Propensity scores: a practical introduction using r,” Journal of MultiDisciplinary Evaluation, vol. 11, no. 25, pp. 68–88, 2015.
  50. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
  51. M. Giomi, F. Boenisch, C. Wehmeyer, and B. Tasnádi, “A unified framework for quantifying privacy risk in synthetic data,” arXiv preprint arXiv:2211.10459, 2022.
  52. F. Houssiau, J. Jordon, S. N. Cohen, O. Daniel, A. Elliott, J. Geddes, C. Mole, C. Rangel-Smith, and L. Szpruch, “Tapas: A toolbox for adversarial privacy auditing of synthetic data,” arXiv preprint arXiv:2211.06550, 2022.
  53. R. Tajeddine, J. Jälkö, S. Kaski, and A. Honkela, “Privacy-preserving data sharing on vertically partitioned data,” arXiv preprint arXiv:2010.09293, 2020.
  54. S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks, “Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models,” IEEE transactions on pattern analysis and machine intelligence, 2021.
Citations (3)

Summary

  • The paper introduces a novel framework that combines local autoencoders with latent diffusion models to synthesize high-quality synthetic tabular data while preserving privacy.
  • It tackles vertical partitioning challenges by encoding mixed data types efficiently and reducing communication overhead through a stacked training approach.
  • Benchmarking reveals significant improvements over GAN-based methods, with up to 43.8 and 29.8 percentage points gains in data resemblance and utility, respectively.

SiloFuse: A Novel Approach for Cross-Silo Synthetic Data Generation using Latent Tabular Diffusion Models

Introduction

The proliferation of proprietary datasets across enterprises presents both a promise for collaborative knowledge discovery and a challenge owing to privacy constraints like GDPR. The generation of high-quality synthetic data that accurately mirrors the statistical properties of real datasets—without compromising privacy—remains a pivotal concern in distributed environments. Addressing this, the paper explores the novel framework SiloFuse, which pioneers the synthesis of cross-silo tabular data by leveraging a distributed latent tabular diffusion architecture.

Synthesis Challenge in Vertical Partitioning

Traditional synthesizers struggle in scenarios involving vertical partitioning of datasets across different silos where data features are distributed and need to be stored on-premise. The main challenges being:

  • Handling mixed data types necessitates innovative encoding strategies for both continuous and categorical variables, avoiding issues like sparsity and poor feature obfuscation inherent in one-hot encoding.
  • Ensuring the synthetic data captures cross-silo feature correlations without centralizing the original datasets, thus respecting privacy constraints.
  • Efficient communication during distributed training, sidestepping costly data exchanges that escalate with increased training iterations.

Framework Design: SiloFuse

SiloFuse introduces a novel architecture combining autoencoders with latent diffusion models to synthesize data:

  • Local Autoencoders: Silos first encode their data features into continuous latents, addressing data diversity (continuous and categorical features) and reducing sparsity. These encoded latents are then sent to a central coordinator.
  • Latent Diffusion Model: At the coordinator, these latents are synthesized using a backbone generative Gaussian diffusion model, ensuring global feature correlations are learned in the latent space.
  • Stacked Training Paradigm: Autoencoders and the diffusion model undergo separate training phases—local autoencoder training followed by centralized diffusion model training—significantly reducing the communication overhead to a single round of latent exchange.

The framework is theoretically backed by proving the impossibility of data reconstruction for vertically partitioned synthesis, enhancing privacy.

Benchmarking and Evaluation

SiloFuse has been rigorously evaluated against centralized methods on nine datasets. The framework:

  • Demonstrates notable performance over GANs, achieving up to 43.8 and 29.8 percentage points improvement in resemblance and utility, respectively.
  • Shows a fixed communication cost advantage over end-to-end training models, due to its unique stacked training approach.
  • Maintains robustness against feature permutations and varying numbers of clients, indicating a high degree of flexibility and resilience in diverse data distribution scenarios.

Implications and Future Directions

SiloFuse's approach to synthetic data generation not only addresses the pressing need for privacy-preserving data sharing across silos but also opens new avenues for collaborative data analysis without compromising data privacy. Its ability to efficiently manage communication costs and maintain data utility under privacy constraints presents a scalable solution for enterprises looking to leverage shared knowledge. Future work could explore enhancements in the model's ability to handle even more diverse datasets, or investigate novel paradigms for secure, privacy-preserving computation to further enrich collaborative data science endeavors.

Conclusion

SiloFuse represents a significant advancement in the field of synthetic data generation, especially for vertically partitioned, cross-silo scenarios. By marrying the concepts of latent diffusion models with autoencoders within a distributed architecture, it innovatively tackles the dual challenge of privacy preservation and data utility. This framework sets a new bar for future research in distributed synthetic data generation, paving the way for more sophisticated and privacy-compliant data collaboration techniques in the digital age.