Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against "Truly Anonymous" Synthetic Datasets (2312.05114v2)

Published 8 Dec 2023 in cs.CR, cs.AI, and cs.LG

Abstract: Generative models producing synthetic data are meant to provide a privacy-friendly approach to releasing data. However, their privacy guarantees are only considered robust when models satisfy Differential Privacy (DP). Alas, this is not a ubiquitous standard, as many leading companies (and, in fact, research papers) use ad-hoc privacy metrics based on testing the statistical similarity between synthetic and real data. In this paper, we examine the privacy metrics used in real-world synthetic data deployments and demonstrate their unreliability in several ways. First, we provide counter-examples where severe privacy violations occur even if the privacy tests pass and instantiate accurate membership and attribute inference attacks with minimal cost. We then introduce ReconSyn, a reconstruction attack that generates multiple synthetic datasets that are considered private by the metrics but actually leak information unique to individual records. We show that ReconSyn recovers 78-100% of the outliers in the train data with only black-box access to a single fitted generative model and the privacy metrics. In the process, we show that applying DP only to the model does not mitigate this attack, as using privacy metrics breaks the end-to-end DP pipeline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (142)
  1. A29WP. Opinion on anonymisation techniques. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216˙en.pdf, 2014.
  2. Deep learning with differential privacy. In ACM CCS, 2016.
  3. Privacy preserving synthetic data release using deep learning. In ECML PKDD, 2018.
  4. Differentially private mixture of generative neural networks. IEEE TKDE, 2018.
  5. Aindo. https://www.aindo.com/, 2023.
  6. R. Anderson. Security engineering: a guide to building dependable distributed systems. John Wiley & Sons, 2020.
  7. A linear reconstruction approach for attribute inference attacks against synthetic data. arXiv:2301.10053, 2023.
  8. AWS. How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy. https://aws.amazon.com/blogs/machine-learning/how-to-evaluate-the-quality-of-the-synthetic-data-measuring-from-the-perspective-of-fidelity-utility-and-privacy/, 2022.
  9. Differentially private query release through adaptive projection. In ICML, 2021.
  10. Differential privacy has disparate impact on model accuracy. In NeurIPS, 2019.
  11. Reconstructing training data with informed adversaries. In IEEE S&P, 2022.
  12. The creation and use of the SIPP synthetic Beta v7. 0. US Census Bureau, 2018.
  13. Data synthesis via differentially private markov random fields. PVLDB, 2021.
  14. Membership inference attacks from first principles. In IEEE S&P, 2022.
  15. Distribution density, tails, and outliers in machine learning: Metrics and applications. arXiv:1910.13427, 2019.
  16. Extracting training data from diffusion models. arXiv:2301.13188, 2023.
  17. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security, 2019.
  18. Extracting training data from large language models. In USENIX Security, 2021.
  19. Gan-leaks: a taxonomy of membership inference attacks against generative models. In ACM CCS, 2020.
  20. A. Cohen and K. Nissim. Linear program reconstruction in practice. JPC, 2020.
  21. A. Cohen and K. Nissim. Towards formalizing the GDPR’s notion of singling out. PNAS, 2020.
  22. Crunchbase. Synthetic data startups pick up more real aash. https://news.crunchbase.com/ai-robotics/synthetic-data-vc-funding-datagen-gretel-nvidia-amazon/, 2022.
  23. ”I need a better description”: an investigation into user expectations for differential privacy. In ACM CCS, 2021.
  24. DataCebo. Synthetic data metrics. https://docs.sdv.dev/sdmetrics/, 2022.
  25. DataCebo. https://datacebo.com/, 2023.
  26. Confidence-ranked reconstruction of census microdata from published statistics. PNAS, 2023.
  27. I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, 2003.
  28. D. Dua and C. Graff. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/adult, 2017.
  29. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
  30. The price of privacy and the limits of LP decoding. In STOC, 2007.
  31. C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2014.
  32. C. Dwork and S. Yekhanin. New efficient attacks on statistical disclosure control mechanisms. In CRYPTO, 2008.
  33. EDPS. Preliminary opinion on privacy by design. https://edps.europa.eu/sites/edp/files/publication/18-05-31˙preliminary˙opinion˙on˙privacy˙by˙design˙en˙0.pdf, 2018.
  34. V. Feldman. Does learning require memorization? a short tale about a long tail. In STOC, 2020.
  35. Financial Conduct Authority. Synthetic data call for input feedback statement. https://www.fca.org.uk/publication/feedback/fs23-1.pdf, 2023.
  36. Forbes. Synthetic data is about to transform artificial intelligence. https://www.forbes.com/sites/robtoews/2022/06/12/synthetic-data-is-about-to-transform-artificial-intelligence/, 2022.
  37. Model inversion attacks that exploit confidence information and basic countermeasures. In ACM CCS, 2015.
  38. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In USENIX Security, 2014.
  39. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In IFIP SEC, 2019.
  40. When the signal is in the noise: Exploiting Diffix’s Sticky Noise. In USENIX Security, 2019.
  41. G. Ganev. When synthetic data met regulation. In ICML Workshop on Generative AI and Law, 2023.
  42. Robin Hood and Matthew Effects: Differential privacy has disparate impact on synthetic data. In ICML, 2022.
  43. Understanding how Differentially Private Generative Models Spend their Privacy Budget. arXiv:2305.10994, 2023.
  44. Understanding database reconstruction attacks on public data. ACM Queue, 2019.
  45. Kamino: Constraint-Aware Differentially Private Data Synthesis. PVLDB, 2021.
  46. Inverting gradients-how easy is it to break privacy in federated learning? NeurIPS, 2020.
  47. A unified framework for quantifying privacy risk in synthetic data. In PETs, 2023.
  48. Gretel. Introducing gretel’s privacy filters. https://gretel.ai/blog/introducing-gretels-privacy-filters, 2021.
  49. Gretel. https://gretel.ai/, 2023.
  50. Gretel. Build smarter with the right data. fast. safe. accurate. https://gretel.ai/synthetics, 2023.
  51. Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. NPJ Digital Medicine, 2023.
  52. Reconstructing training data from trained neural networks. In NeurIPS, 2022.
  53. Logan: membership inference attacks against generative models. In PoPETs, 2019.
  54. Hazy. Financial Services Use Cases. https://hazy.com/customers/financial-services, 2022.
  55. Hazy. https://hazy.com/, 2023.
  56. Hazy. Privacy. https://hazy.com/docs/metrics/privacy/#distance-to-closest-record, 2023.
  57. Monte carlo and reconstruction membership inference attacks against generative models. In PoPETs, 2019.
  58. A Framework for Auditable Synthetic Data Generation. arXiv:2211.11540, 2022.
  59. TAPAS: a toolbox for adversarial privacy auditing of synthetic data. In NeurIPS Workshop on SyntheticData4ML, 2022.
  60. On the difficulty of achieving differential privacy in practice: user-level guarantees in aggregate location data. Nature Communications, 2022.
  61. Multipurpose synthetic population for policy applications. Publications Office of the European Union, 2022.
  62. Differential privacy: an economic method for choosing epsilon. In IEEE CSF, 2014.
  63. Information Commissioner’s Office. Chapter 5: privacy-enhancing technologies (PETs). https://ico.org.uk/media/about-the-ico/consultations/4021464/chapter-5-anonymisation-pets.pdf, 2022.
  64. Auditing differentially private machine learning: How private is private sgd? NeurIPS, 2020.
  65. Synthetic Data–what, why and how? arXiv:2205.03257, 2022.
  66. PATE-GAN: generating synthetic data with differential privacy guarantees. In ICLR, 2018.
  67. Arbitrary decisions are a hidden cost of differentially-private training. arXiv:2302.14517, 2023.
  68. MNIST handwritten digit database. ATT Labs, 2010.
  69. Differentially private synthesization of multi-dimensional data using copula functions. In EDBT, 2014.
  70. Tabular data synthesis with generative adversarial networks: design space and optimizations. VLDBJ, 2023.
  71. Iterative methods for private synthetic data: Unifying framework and new methods. NeurIPS, 2021.
  72. G-PATE: Scalable differentially private data generator via private aggregation of teacher discriminators. In NeurIPS, 2021.
  73. C. A. F. López and A. Elbi. On the legal nature of synthetic data. In NeurIPS SyntheticData4ML, 2022.
  74. Empirical evaluation on synthetic data generation with generative adversarial network. In WIMS, 2019.
  75. dpart: differentially private autoregressive tabular, a general framework for synthetic data generation. In TPDP, 2022.
  76. Winning the NIST Contest: a scalable and general approach to differentially private synthetic data. JPC, 2021.
  77. Graphical-model based estimation and inference for differential privacy. In ICML, 2019.
  78. F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.
  79. Achilles’ Heels: vulnerable record identification in synthetic data publishing. arXiv:2306.10308, 2023.
  80. Mobey Forum. Help Me Understand: AI-Generated Synthetic Data. https://mobeyforum.org/download/?file=AI-generated-synthetic-data-report-2.pdf, 2022.
  81. MOSTLY AI. Truly Anonymous Synthetic Data – Evolving Legal Definitions and Technologies (Part II). https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-definitions-part-ii/, 2020.
  82. MOSTLY AI. How to implement data privacy? A conversation with Klaudius Kalcher. https://mostly.ai/data-democratization-podcast/how-to-implement-data-privacy/, 2021.
  83. MOSTLY AI. 15 synthetic data use cases in banking. https://mostly.ai/blog/15-synthetic-data-use-cases-in-banking/, 2022.
  84. MOSTLY AI. MOSTLY AI 2.4 synthetic data platform – now with faster and shareable interactive QA reports. https://mostly.ai/news/mostly-ai-synthetic-data-platform/, 2022.
  85. MOSTLY AI. https://mostly.ai/, 2023.
  86. MOSTLY AI. Data anonymization - synthetic data for maximum privacy and utility. https://mostly.ai/use-case/data-anonymization-with-synthetic-data, 2023.
  87. MOSTLY AI. What is data augmentation and how to use it to supercharge your data? https://mostly.ai/blog/data-augmentation/, 2023.
  88. Tight Auditing of Differentially Private Machine Learning. arXiv:2302.07956, 2023.
  89. Adversary instantiation: lower bounds for differentially private machine learning. In IEEE S&P, 2021.
  90. NHS England. A&E Synthetic Data. https://data.england.nhs.uk/dataset/a-e-synthetic-data, 2021.
  91. NIST. 2018 Differential privacy synthetic data challenge. https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic, 2018.
  92. NIST. 2020 Differential privacy temporal map challenge. https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2020-differential-privacy-temporal, 2020.
  93. OECD. Emerging privacy-enhancing technologies. https://www.oecd-ilibrary.org/content/paper/bf121be4-en, 2023.
  94. ONS. Privacy and data confidentiality methods: a data and analysis method review. https://analysisfunction.civilservice.gov.uk/policy-store/privacy-and-data-confidentiality-methods-a-national-statisticians-quality-review-nsqr/, 2018.
  95. ONS DSC. SynthGauge. https://github.com/datasciencecampus/synthgauge, 2022.
  96. On utility and privacy in synthetic genomic data. In NDSS, 2022.
  97. D. Panfilo. Generating Privacy-Compliant, Utility-Preserving Synthetic Tabular and Relational Datasets Through Deep Learning. University of Trieste, 2022.
  98. A Deep Learning-Based Pipeline for the Generation of Synthetic Tabular Data. IEEE Access, 2023.
  99. Semi-supervised knowledge transfer for deep learning from private training data. In ICLR, 2017.
  100. Scalable private learning with pate. In ICLR, 2018.
  101. Data Synthesis Based on Generative Adversarial Networks. PVLDB, 2018.
  102. The synthetic data vault. In DSAA, 2016.
  103. DataSynthesizer: privacy-preserving synthetic datasets. In SSDBM, 2017.
  104. M. Platzer and T. Reutterer. Holdout-based empirical assessment of mixed-type synthetic data. Frontiers in Big Data, 2021.
  105. Knock knock, who’s there? Membership inference on aggregate location data. In NDSS, 2018.
  106. Replica. https://replica-analytics.com/, 2023.
  107. Replica Analytics. Practical synthetic data generation: balancing privacy and the broad availability of data. O’Reilly Media, Incorporated, 2020.
  108. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 2019.
  109. Updates-leak: Data set inference and reconstruction attacks in online learning. In USENIX Security, 2020.
  110. Membership inference attacks against machine learning models. In IEEE S&P, 2017.
  111. Machine learning models that remember too much. In ACM CCS, 2017.
  112. Synthetic data – anonymization groundhog day. In Usenix Security, 2022.
  113. Statice. https://www.statice.ai/, 2023.
  114. Statice. Anonymization and data privacy with Statice. https://privacy.statice.ai/hubfs/Resources/brochures/Anonymization˙data˙privacy˙Statice.pdf, 2023.
  115. Syntegra. Fidelity and privacy of synthetic medical data. arXiv:2101.08658, 2021.
  116. Syntegra. https://www.syntegra.io/, 2023.
  117. Synthesized. https://www.synthesized.io/, 2023.
  118. Synthesized. Strict synthesis. https://docs.synthesized.io/sdk/latest/features/compliance/strict˙synthesis, 2023.
  119. Benchmarking differentially private synthetic data generation algorithms. In PPAI, 2022.
  120. TechCrunch. The market for synthetic data is bigger than you think. https://techcrunch.com/2022/05/10/the-market-for-synthetic-data-is-bigger-than-you-think/, 2022.
  121. Tonic. How to solve the problem of imbalanced datasets: meet Djinn by Tonic. https://www.tonic.ai/blog/how-to-solve-the-problem-of-imbalanced-datasets-meet-djinn-by-tonic, 2022.
  122. Tonic. https://www.tonic.ai/, 2023.
  123. Tonic. Can secure synthetic data maintain data utility? https://www.tonic.ai/blog/can-secure-synthetic-data-maintain-data-utility, 2023.
  124. Debugging differential privacy: A case study for privacy auditing. arXiv:2202.12219, 2022.
  125. Beyond privacy trade-offs with structured transparency. arXiv:2012.08347, 2020.
  126. UN. The United Nations Guide on privacy-enhancing technologies for official statistics. https://unstats.un.org/bigdata/task-teams/privacy/guide/2023˙UN%20PET%20Guide.pdf, 2023.
  127. New oracle-efficient algorithms for private synthetic data release. In ICML, 2020.
  128. Beyond inferring class representatives: User-level privacy leakage from federated learning. In INFOCOM, 2019.
  129. Detecting overfitting of deep generative networks via latent recovery. In IEEE CVPR, 2019.
  130. Differentially private generative adversarial network. arXiv:1802.06739, 2018.
  131. Modeling tabular data using conditional gan. NeurIPS, 2019.
  132. Assessing privacy and quality of synthetic health data. In AIDR, 2019.
  133. YData. https://ydata.ai/, 2023.
  134. YData. Synthetic data quality metrics. https://ydata.ai/synthetic-data-quality-metrics, 2023.
  135. Privacy risk in machine learning: Analyzing the connection to overfitting. In IEEE CSF, 2018.
  136. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ Digital Medicine, 2023.
  137. Analyzing information leakage of updates to natural language models. In ACM CCS, 2020.
  138. Privbayes: private data release via bayesian networks. ACM TODS, 2017.
  139. Differentially private releasing via deep generative model (technical report). arXiv:1801.01594, 2018.
  140. PrivSyn: Differentially Private Data Synthesis. In USENIX Security, 2021.
  141. Ctab-gan: effective table data synthesizing. In ACML, 2021.
  142. Deep leakage from gradients. NeurIPS, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Georgi Ganev (15 papers)
  2. Emiliano De Cristofaro (117 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.