Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation (2401.06263v1)

Published 11 Jan 2024 in cs.LG

Abstract: Realistic synthetic tabular data generation encounters significant challenges in preserving privacy, especially when dealing with sensitive information in domains like finance and healthcare. In this paper, we introduce \textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original tabular datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic Models} (DDPMs), our approach addresses the inherent complexities in tabular data, such as mixed attribute types and implicit relationships. More critically, FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. We extend DDPMs into the federated setting for tabular data generation, which includes a synchronous update scheme and weighted averaging for effective model aggregation. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Federated Learning: A Survey on Enabling Technologies, Protocols, and Applications. IEEE Access, 8: 140699–140725.
  2. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, 1–8.
  3. A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing. In International Conference on Machine Learning, 1399–1411. PMLR.
  4. Synthesizing test data for fraud detection systems. In 19th Annual Computer Security Applications Conference, 2003. Proceedings., 384–394.
  5. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
  6. Pearson correlation coefficient. In Noise reduction in speech processing, 37–40. Springer.
  7. Flower: A Friendly Federated Learning Research Framework. arXiv:2007.14390.
  8. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.
  9. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv preprint arXiv:2303.12712.
  10. A Survey on Generative Diffusion Model. arXiv preprint arXiv:2209.02646.
  11. Extracting Training Data from Large Language Models. arXiv preprint arXiv:2012.07805.
  12. Everybody Dance Now. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5933–5942.
  13. Synthetic Data Generation for Fraud Detection using GANs. arXiv:2109.12546.
  14. CodeT: Code Generation with Generated Tests. arXiv preprint arXiv:2207.10397.
  15. Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  16. de Goede, M. 2023. Training Diffusion Models with Federated Learning: A Communication-Efficient Model for Cross-Silo Federated Image Generation.
  17. Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems, 34: 8780–8794.
  18. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174: 114582.
  19. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 1322–1333.
  20. Masked Diffusion Transformer is a Strong Image Synthesizer. arXiv preprint arXiv:2303.14389.
  21. Generative Adversarial Nets. Advances in neural information processing systems, 27: 2672–2680.
  22. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  23. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 12454–12465.
  24. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International conference on learning representations.
  25. Phoenix: A Federated Generative Diffusion Model. arXiv preprint arXiv:2306.04098.
  26. Advances and Open Problems in Federated Learning. arXiv preprint arXiv:1912.04977.
  27. OCT-GAN: Neural ODE-Based Conditional Tabular GANs. In Proceedings of the Web Conference 2021, 1506–1515.
  28. Adam: A Method for Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  29. TabDDPM: Modelling Tabular Data with Diffusion Models. arXiv:2209.15421.
  30. CoderL: Mastering Code Generation Through Pretrained Models and Deep Reinforcement Learning. Advances in Neural Information Processing Systems, 35: 21314–21328.
  31. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. arXiv preprint arXiv:2306.15687.
  32. A survey on federated learning systems: Vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering.
  33. Competition-Level Code Generation with AlphaCode. Science, 378(6624): 1092–1097.
  34. Massey, F. J. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253): 68–78.
  35. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Artificial Intelligence and Statistics, 1273–1282. PMLR.
  36. Federated Learning: Collaborative Machine Learning Without Centralized Training Data. Google Research Blog, 3.
  37. Generative Trees: Adversarial and Copycat. arXiv preprint arXiv:2201.11205.
  38. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  39. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035.
  40. The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410.
  41. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
  42. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
  43. ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. arXiv preprint arXiv:1806.01246.
  44. FinDiff: Diffusion Models for Financial Tabular Data Generation. arXiv:2309.01472.
  45. Federated Continual Learning to Detect Accounting Anomalies in Financial Auditing. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022).
  46. Federated and Privacy-Preserving Learning of Accounting Data in Financial Statement Audits. In Proceedings of the Third ACM International Conference on AI in Finance, 105–113.
  47. Adversarial Learning of Deepfakes in Accounting. NeurIPS’19 Workshop on Robust AI in Financial Services.
  48. Membership Inference Attacks Against Machine Learning Models. In 2017 IEEE Symposium on Security and Privacy (SP), 3–18. IEEE.
  49. Make-a-Video: Text-to-Video Generation Without Text-Video Data. arXiv preprint arXiv:2209.14792.
  50. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
  51. Differentially Private Synthetic Medical Data Generation Using Convolutional GANs. Information Sciences, 586: 485–500.
  52. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.
  53. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
  54. Causal-TGAN: Modeling Tabular Data Using Causally-Aware GAN. In ICLR Workshop on Deep Generative Models for Highly Structured Data.
  55. Modeling tabular data using conditional gan. NeurIPS, 32.
  56. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv preprint arXiv:2209.00796.
  57. Magvit: Masked Generative Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10459–10469.
  58. A Survey on Federated Learning. Knowledge-Based Systems, 216: 106775.
  59. GANBLR: A Tabular Data Generation Model. In International Conference on Data Mining (ICDM), 181–190. IEEE.
  60. CTAB-GAN: Effective Table Data Synthesizing. In Asian Conference on Machine Learning, 97–112. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Timur Sattarov (11 papers)
  2. Marco Schreyer (14 papers)
  3. Damian Borth (64 papers)
Citations (2)

Summary

Federated Learning and Synthetic Data

In the domain of data analytics, privacy-preserving techniques are of utmost importance, particularly when dealing with sensitive information from sectors like finance and healthcare. A groundbreaking approach to tackle this issue is through the use of synthetic data, which not only aids in preserving privacy but also facilitates data sharing, compliance with regulations, and a deeper analysis without disclosure risks. Synthetic data is a generative construct that mirrors the statistical properties of real data, allowing for insights while protecting individual data points.

Innovations in Federated Learning

A novel framework named Federated Tabular Diffusion, or FedTabDiff, has been introduced for generating mixed-type tabular data—a composite of categorical, numerical, and ordinal data distributions. This model advances the fusion of Denoising Diffusion Probabilistic Models (DDPMs) with federated learning. DDPMs are well-regarded for producing high-quality synthetic images, while federated learning (FL) circumvents the need to consolidate sensitive data into one location, instead allowing multiple parties to contribute to a joint model while retaining their data locally.

Privacy-Preserving Data Generation

FedTabDiff is remarkable for its approach to generative modeling that respects data privacy concerns. It employs a system where participants of a federated network, referred to as clients, train local models on their available data. Periodically, these local model parameters are communicated to a central server, which aggregates them to enhance a global model, and then redistributes the updated version to all clients. This cycle ensures no raw data ever leaves its original repository, thus maintaining confidentiality and integrity.

Evaluation and Results

FedTabDiff underwent thorough evaluation using financial and healthcare datasets. It showed proficiency in data synthesis while ensuring compliance with privacy requirements. The model consistently outperformed non-federated counterparts across a range of metrics including fidelity (the resemblance to the original data), utility (applicability for downstream tasks), and coverage (diversity representation). Significantly, it also rated high on privacy, eliminating worries over data leakage.

Conclusion and Future Envisage

The superiority of the FedTabDiff model in producing privacy-compliant, high-fidelity, and useful synthetic tabular data is a remarkable achievement in federated learning. The methodology employed by FedTabDiff opens doors to more collaborative and responsible use of AI, especially in sensitive fields. Prospective research directions may include enhancing federated learning protocols and refining diffusion model processes, contributing to the acceleration of secure AI implementations.

X Twitter Logo Streamline Icon: https://streamlinehq.com