FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation (2401.06263v1)
Abstract: Realistic synthetic tabular data generation encounters significant challenges in preserving privacy, especially when dealing with sensitive information in domains like finance and healthcare. In this paper, we introduce \textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original tabular datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic Models} (DDPMs), our approach addresses the inherent complexities in tabular data, such as mixed attribute types and implicit relationships. More critically, FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. We extend DDPMs into the federated setting for tabular data generation, which includes a synchronous update scheme and weighted averaging for effective model aggregation. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
- Federated Learning: A Survey on Enabling Technologies, Protocols, and Applications. IEEE Access, 8: 140699–140725.
- Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, 1–8.
- A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing. In International Conference on Machine Learning, 1399–1411. PMLR.
- Synthesizing test data for fraud detection systems. In 19th Annual Computer Security Applications Conference, 2003. Proceedings., 384–394.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
- Pearson correlation coefficient. In Noise reduction in speech processing, 37–40. Springer.
- Flower: A Friendly Federated Learning Research Framework. arXiv:2007.14390.
- Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.
- Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv preprint arXiv:2303.12712.
- A Survey on Generative Diffusion Model. arXiv preprint arXiv:2209.02646.
- Extracting Training Data from Large Language Models. arXiv preprint arXiv:2012.07805.
- Everybody Dance Now. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5933–5942.
- Synthetic Data Generation for Fraud Detection using GANs. arXiv:2109.12546.
- CodeT: Code Generation with Generated Tests. arXiv preprint arXiv:2207.10397.
- Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- de Goede, M. 2023. Training Diffusion Models with Federated Learning: A Communication-Efficient Model for Cross-Silo Federated Image Generation.
- Diffusion Models Beat GANs on Image Synthesis. Advances in Neural Information Processing Systems, 34: 8780–8794.
- Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174: 114582.
- Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 1322–1333.
- Masked Diffusion Transformer is a Strong Image Synthesizer. arXiv preprint arXiv:2303.14389.
- Generative Adversarial Nets. Advances in neural information processing systems, 27: 2672–2680.
- Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33: 6840–6851.
- Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34: 12454–12465.
- PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International conference on learning representations.
- Phoenix: A Federated Generative Diffusion Model. arXiv preprint arXiv:2306.04098.
- Advances and Open Problems in Federated Learning. arXiv preprint arXiv:1912.04977.
- OCT-GAN: Neural ODE-Based Conditional Tabular GANs. In Proceedings of the Web Conference 2021, 1506–1515.
- Adam: A Method for Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- TabDDPM: Modelling Tabular Data with Diffusion Models. arXiv:2209.15421.
- CoderL: Mastering Code Generation Through Pretrained Models and Deep Reinforcement Learning. Advances in Neural Information Processing Systems, 35: 21314–21328.
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. arXiv preprint arXiv:2306.15687.
- A survey on federated learning systems: Vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering.
- Competition-Level Code Generation with AlphaCode. Science, 378(6624): 1092–1097.
- Massey, F. J. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253): 68–78.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In Artificial Intelligence and Statistics, 1273–1282. PMLR.
- Federated Learning: Collaborative Machine Learning Without Centralized Training Data. Google Research Blog, 3.
- Generative Trees: Adversarial and Copycat. arXiv preprint arXiv:2201.11205.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035.
- The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
- ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. arXiv preprint arXiv:1806.01246.
- FinDiff: Diffusion Models for Financial Tabular Data Generation. arXiv:2309.01472.
- Federated Continual Learning to Detect Accounting Anomalies in Financial Auditing. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022).
- Federated and Privacy-Preserving Learning of Accounting Data in Financial Statement Audits. In Proceedings of the Third ACM International Conference on AI in Finance, 105–113.
- Adversarial Learning of Deepfakes in Accounting. NeurIPS’19 Workshop on Robust AI in Financial Services.
- Membership Inference Attacks Against Machine Learning Models. In 2017 IEEE Symposium on Security and Privacy (SP), 3–18. IEEE.
- Make-a-Video: Text-to-Video Generation Without Text-Video Data. arXiv preprint arXiv:2209.14792.
- Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
- Differentially Private Synthetic Medical Data Generation Using Convolutional GANs. Information Sciences, 586: 485–500.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
- Causal-TGAN: Modeling Tabular Data Using Causally-Aware GAN. In ICLR Workshop on Deep Generative Models for Highly Structured Data.
- Modeling tabular data using conditional gan. NeurIPS, 32.
- Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv preprint arXiv:2209.00796.
- Magvit: Masked Generative Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10459–10469.
- A Survey on Federated Learning. Knowledge-Based Systems, 216: 106775.
- GANBLR: A Tabular Data Generation Model. In International Conference on Data Mining (ICDM), 181–190. IEEE.
- CTAB-GAN: Effective Table Data Synthesizing. In Asian Conference on Machine Learning, 97–112. PMLR.
- Timur Sattarov (11 papers)
- Marco Schreyer (14 papers)
- Damian Borth (64 papers)