2000 character limit reached
Synthetic Data: Methods, Use Cases, and Risks (2303.01230v3)
Published 1 Mar 2023 in cs.CR, cs.AI, and cs.CY
Abstract: Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of a sensitive nature, and thus, sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
- De-anonymizing social networks. In IEEE S&P, 2009.
- Apostolos Pyrgelis. On Location, Time, and Membership: Studying How Aggregate Location Data Can Harm Users’ Privacy. https://www.benthamsgaze.org/2018/10/02/on-location-time-and-membership-studying-how-aggregate-location-data-can-harm-users-privacy/, 2018.
- Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In NDSS, 2018.
- Differential privacy: A primer for a non-technical audience. In Privacy Law Scholars Conference, 2017.
- Derek Snow. Deep Generative Models are Privacy Regularisers. https://blog.ml-quant.com/p/deep-generative-models-are-privacy, 2021.
- Generative Adversarial Nets. In NIPS, 2014.
- Progressive growing of gans for improved quality, stability, and variation. arXiv:1710.10196, 2017.
- Mostly.ai. Synthetic training data for improving fraud and anomaly AI’s performance. https://mostly.ai/case-study/synthetic-training-data-for-machine-learning-fraud-detection/, 2023.
- Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine, 3(1), 2020.
- A baseline for attribute disclosure risk in synthetic data. In ACM CODASPY, 2020.
- Synthetic data–anonymisation groundhog day. In USENIX Security Symposium, 2022.
- Membership inference attacks against synthetic health data. Journal of biomedical informatics, 125, 2022.
- Scalable private learning with PATE. arXiv:1802.08908, 2018.
- Differentially private generative adversarial network. arXiv:1802.06739, 2018.
- Differential Privacy Synthetic Data Generation using WGANs. https://github.com/nesl/nist_differential_privacy_synthetic_data_challenge/, 2019.
- DPSyn: Experiences in the nist differential privacy data synthesis challenges. arXiv:2106.12949, 2021.
- PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems, 2017.
- PATE-GAN: Generating synthetic data with differential privacy guarantees. In ICLR, 2018.
- European Data Protection Supervisor (EDPS). Opinion 5/2018-Preliminary Opinion on privacy by design. 2018.
- Emiliano De Cristofaro (117 papers)