Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthetic Data: Methods, Use Cases, and Risks (2303.01230v3)

Published 1 Mar 2023 in cs.CR, cs.AI, and cs.CY

Abstract: Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of a sensitive nature, and thus, sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. De-anonymizing social networks. In IEEE S&P, 2009.
  2. Apostolos Pyrgelis. On Location, Time, and Membership: Studying How Aggregate Location Data Can Harm Users’ Privacy. https://www.benthamsgaze.org/2018/10/02/on-location-time-and-membership-studying-how-aggregate-location-data-can-harm-users-privacy/, 2018.
  3. Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In NDSS, 2018.
  4. Differential privacy: A primer for a non-technical audience. In Privacy Law Scholars Conference, 2017.
  5. Derek Snow. Deep Generative Models are Privacy Regularisers. https://blog.ml-quant.com/p/deep-generative-models-are-privacy, 2021.
  6. Generative Adversarial Nets. In NIPS, 2014.
  7. Progressive growing of gans for improved quality, stability, and variation. arXiv:1710.10196, 2017.
  8. Mostly.ai. Synthetic training data for improving fraud and anomaly AI’s performance. https://mostly.ai/case-study/synthetic-training-data-for-machine-learning-fraud-detection/, 2023.
  9. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine, 3(1), 2020.
  10. A baseline for attribute disclosure risk in synthetic data. In ACM CODASPY, 2020.
  11. Synthetic data–anonymisation groundhog day. In USENIX Security Symposium, 2022.
  12. Membership inference attacks against synthetic health data. Journal of biomedical informatics, 125, 2022.
  13. Scalable private learning with PATE. arXiv:1802.08908, 2018.
  14. Differentially private generative adversarial network. arXiv:1802.06739, 2018.
  15. Differential Privacy Synthetic Data Generation using WGANs. https://github.com/nesl/nist_differential_privacy_synthetic_data_challenge/, 2019.
  16. DPSyn: Experiences in the nist differential privacy data synthesis challenges. arXiv:2106.12949, 2021.
  17. PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems, 2017.
  18. PATE-GAN: Generating synthetic data with differential privacy guarantees. In ICLR, 2018.
  19. European Data Protection Supervisor (EDPS). Opinion 5/2018-Preliminary Opinion on privacy by design. 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Emiliano De Cristofaro (117 papers)
Citations (8)