Papers
Topics
Authors
Recent
Search
2000 character limit reached

Synthetic Data: Methods, Use Cases, and Risks

Published 1 Mar 2023 in cs.CR, cs.AI, and cs.CY | (2303.01230v3)

Abstract: Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of a sensitive nature, and thus, sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. De-anonymizing social networks. In IEEE S&P, 2009.
  2. Apostolos Pyrgelis. On Location, Time, and Membership: Studying How Aggregate Location Data Can Harm Users’ Privacy. https://www.benthamsgaze.org/2018/10/02/on-location-time-and-membership-studying-how-aggregate-location-data-can-harm-users-privacy/, 2018.
  3. Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In NDSS, 2018.
  4. Differential privacy: A primer for a non-technical audience. In Privacy Law Scholars Conference, 2017.
  5. Derek Snow. Deep Generative Models are Privacy Regularisers. https://blog.ml-quant.com/p/deep-generative-models-are-privacy, 2021.
  6. Generative Adversarial Nets. In NIPS, 2014.
  7. Progressive growing of gans for improved quality, stability, and variation. arXiv:1710.10196, 2017.
  8. Mostly.ai. Synthetic training data for improving fraud and anomaly AI’s performance. https://mostly.ai/case-study/synthetic-training-data-for-machine-learning-fraud-detection/, 2023.
  9. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine, 3(1), 2020.
  10. A baseline for attribute disclosure risk in synthetic data. In ACM CODASPY, 2020.
  11. Synthetic data–anonymisation groundhog day. In USENIX Security Symposium, 2022.
  12. Membership inference attacks against synthetic health data. Journal of biomedical informatics, 125, 2022.
  13. Scalable private learning with PATE. arXiv:1802.08908, 2018.
  14. Differentially private generative adversarial network. arXiv:1802.06739, 2018.
  15. Differential Privacy Synthetic Data Generation using WGANs. https://github.com/nesl/nist_differential_privacy_synthetic_data_challenge/, 2019.
  16. DPSyn: Experiences in the nist differential privacy data synthesis challenges. arXiv:2106.12949, 2021.
  17. PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems, 2017.
  18. PATE-GAN: Generating synthetic data with differential privacy guarantees. In ICLR, 2018.
  19. European Data Protection Supervisor (EDPS). Opinion 5/2018-Preliminary Opinion on privacy by design. 2018.
Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.