FLAIM: AIM-based Synthetic Data Generation in the Federated Setting (2310.03447v3)
Abstract: Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We first show that it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show we can improve utility while reducing overhead.
- High-throughput semi-honest secure three-party computation with an honest majority. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 805–817.
- Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance. ACM, 1–8.
- Differentially private query release through adaptive projection. In International Conference on Machine Learning. PMLR, PMLR, 457–467.
- Secure single-server aggregation with (poly) logarithmic overhead. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1253–1269.
- Jock Blackard. 1998. Covertype. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C50K5N.
- R. Bock. 2007. MAGIC Gamma Telescope. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C52C8B.
- Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography: 14th International Conference, TCC 2016-B, Beijing, China, October 31-November 3, 2016, Proceedings, Part I. Springer, Springer, 635–658.
- The discrete gaussian for differential privacy. Advances in Neural Information Processing Systems 33 (2020), 15676–15688.
- DARPA. 1999. Darpa intrusion detection evaluation,. https://www.ll.mit.edu/r-d/datasets/1999-darpa-intrusion-detection-evaluation-dataset
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference. Springer, Springer, 265–284.
- Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science (2014).
- Understanding how Differentially Private Generative Models Spend their Privacy Budget. arXiv:2305.10994 [cs.LG]
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
- A simple and practical algorithm for differentially private data release. Advances in neural information processing systems 25 (2012).
- TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data. arXiv preprint arXiv:2211.06550 (2022).
- Papaya: Practical, private, and scalable federated learning. Proceedings of Machine Learning and Systems 4 (2022), 814–832.
- Kaggle. 2017. Credit card fraud dataset. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning. PMLR, PMLR, 5213–5225.
- Advances and Open Problems in Federated Learning. arXiv:1912.04977 [cs.LG]
- Marcel Keller. 2020. MP-SPDZ: A versatile framework for multi-party computation. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. ACM, 1575–1590.
- An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392.
- Ronny Kohavi and Barry Becker. 1996. Adult dataset. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/nomao
- Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, IEEE, 965–978.
- Iterative methods for private synthetic data: Unifying framework and new methods. Advances in Neural Information Processing Systems 34 (2021), 690–702.
- On the Utility Recovery Incapability of Neural Net-based Differential Private Tabular Training Data Synthesizer under Privacy Deregulation. arXiv:2211.15809 [cs.LG]
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
- AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. arXiv preprint arXiv:2201.12677 (2022).
- Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning. PMLR, PMLR, 4435–4444.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, PMLR, 1273–1282.
- Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963 (2017).
- Ofer Mendelevitch and Michael D Lesh. 2021. Fidelity and privacy of synthetic medical data. arXiv preprint arXiv:2101.08658 (2021).
- Bank Marketing. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5K306.
- Quantifying the privacy risks of learning high-dimensional graphical models. In International Conference on Artificial Intelligence and Statistics. PMLR, PMLR, 2287–2295.
- OpenDP. 2021. smartnoise-sdk. https://github.com/opendp/smartnoise-sdk
- The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 399–410. https://doi.org/10.1109/DSAA.2016.49
- Secure Multiparty Computation for Synthetic Data Generation from Distributed Data. arXiv preprint arXiv:2210.07332 (2022).
- Facebook Research. 2021. FLSim. https://github.com/facebookresearch/FLSim
- Veegan: Reducing mode collapse in GANs using implicit variational learning. Advances in neural information processing systems 30 (2017).
- Synthetic data–anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22). 1451–1468.
- Benchmarking differentially private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238 (2021).
- Membership Inference Attacks against Synthetic Data through Overfitting Detection. arXiv preprint arXiv:2302.12580 (2023).
- Boris van Breugel and Mihaela van der Schaar. 2023. Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data. arXiv preprint arXiv:2304.03722 (2023).
- Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).
- Using Bayesian networks to create synthetic data. Journal of Official Statistics 25, 4 (2009), 549.
- Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–41.
- Samuel Maddock (6 papers)
- Graham Cormode (69 papers)
- Carsten Maple (65 papers)