Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data (2301.10053v3)

Published 24 Jan 2023 in cs.LG and cs.CR

Abstract: Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. US Census 2020. Data Metrics for 2020 Disclosure Avoidance. https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/disclosure-avoidance-system/2020-03-25-data-metrics-2020-da.pdf, 2020.
  2. Differentially Private Query Release Through Adaptive Projection. In ICML, 2021.
  3. Differential privacy has disparate impact on model accuracy. In Advances in neural information processing systems, 2019.
  4. The creation and use of the SIPP synthetic Beta v7.0. US Census Bureau, 2018.
  5. Claire McKay Bowen. Utility Metrics for Differential Privacy: No One-Size-Fits-All. https://www.nist.gov/blogs/cybersecurity-insights/utility-metrics-differential-privacy-no-one-size-fits-all, 2021.
  6. Statistical inference is not a privacy violation. https://differentialprivacy.org/inference-is-not-a-privacy-violation/, 2021.
  7. Data Synthesis via Differentially Private Markov Random Fields. VLDB Endowment, 14(11), 2021.
  8. Membership Inference Attacks From First Principles. In IEEE S&P, 2022.
  9. Extracting Training Data from Large Language Models. In USENIX, 2021.
  10. GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. In ACM CCS, 2020.
  11. Aloni Cohen. Attacks on Deidentification’s Defenses. In USENIX, 2022.
  12. Linear Program Reconstruction in Practice. Journal of Privacy and Confidentiality, 10(1), 2020.
  13. QuerySnout: Automating the Discovery of Attribute Inference Attacks against Query-Based Systems. In ACM CCS, 2022.
  14. Challenges towards the next frontier in privacy. arXiv:2304.06929, 2023.
  15. DataSF. Fire Department Calls for Service. https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3, 2020.
  16. On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known. The Annals of Mathematical Statistics, 11(4), 1940.
  17. Confidence-ranked reconstruction of census microdata from published statistics. PNAS, 120(8), 2023.
  18. Retiring Adult: New Datasets for Fair Machine Learning. In NeurIPS, 2021.
  19. Revealing Information while Preserving Privacy. In ACM Symposium on Principles of Database Systems, 2003.
  20. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, 2006.
  21. The price of privacy and the limits of LP decoding. In ACM Symposium on Theory of Computing, 2007.
  22. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 2014.
  23. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing. In USENIX, 2014.
  24. Pool Inference Attacks on Local Differential Privacy: Quantifying the Privacy Guarantees of Apple’s Count Mean Sketch in Practice. In USENIX, 2022.
  25. When the signal is in the noise: Exploiting Diffix’s Sticky Noise. In USENIX, 2019.
  26. Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data. In ICML, 2022.
  27. A Unified Framework for Quantifying Privacy Risk in Synthetic Data. In PETS, 2023.
  28. LOGAN: Membership Inference Attacks Against Generative Models. In PETS, 2019.
  29. Monte Carlo and Reconstruction Membership Inference Attacks against Generative Models. In PETS, 2019.
  30. TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
  31. Multipurpose synthetic population for policy applications. JRC128595, 2022.
  32. Auditing Differentially Private Machine Learning: How Private is Private SGD? In NeurIPS, 2020.
  33. Are Attribute Inference Attacks Just Imputation? In ACM CCS, 2022.
  34. Copula-Based Approach to Synthetic Population Generation. PloS One, 11(8), 2016.
  35. The Power of Linear Reconstruction Attacks. In ACM-SIAM Symposium on Discrete Algorithms, 2013.
  36. The use of differential privacy for census data and its impact on redistricting: The case of the 2020 US Census. Science Advances, 7(41), 2021.
  37. Estimation of treatment effects from combined data: Identification versus data security. In Economic Analysis of the Digital Economy, pages 279–308. University of Chicago Press, April 2015.
  38. Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods. In NeurIPS, 2021.
  39. Measurement error and the replication crisis. Science, 355(6325), 2017.
  40. Empirical Evaluation on Synthetic Data Generation with Generative Adversarial Network. In International Conference on Web Intelligence, Mining and Semantics, 2019.
  41. Winning the nist contest: A scalable and general approach to differentially private synthetic data. Journal of Privacy and Confidentiality, 11(3), 2021.
  42. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. VLDB Endowment, 15(11), 2022.
  43. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. National Academies Press, 2020.
  44. National Institute of Standards and Technology. 2018 Differential Privacy Synthetic Data Challenge. https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic, 2018.
  45. Paul Ohm. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. UCLA Law Review, 57, 2009.
  46. On Utility and Privacy in Synthetic Genomic Data. In NDSS, 2022.
  47. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In International Conference on Scientific and Statistical Database Management, 2017.
  48. What Does The Crowd Say About You? Evaluating Aggregation-based Location Privacy. In PETS, 2017.
  49. Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In NDSS, 2018.
  50. Jerome P Reiter. Using CART to Generate Partially Synthetic Public Use Microdata. Journal of Official Statistics, 21(3), 2005.
  51. Challenge Design and Lessons Learned from the 2018 Differential Privacy Challenges, 2021.
  52. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10(1), 2019.
  53. Differentially Private Synthetic Data: Applied Evaluations and Enhancements. arXiv:2011.05537, 2020.
  54. Donald B Rubin. Statistical Disclosure Limitation. Journal of official Statistics, 9(2), 1993.
  55. Membership Inference Attacks against Machine Learning Models. In IEEE S&P, 2017.
  56. The Royal Society. What is synthetic data, and how can it advance research and development? https://royalsociety.org/blog/2022/05/synthetic-data/, 2022.
  57. Synthetic Data – Anonymisation Groundhog Day. In USENIX, 2022.
  58. Latanya Sweeney. Weaving Technology and Policy Together to Maintain Confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3), 1997.
  59. Benchmarking Differentially Private Synthetic Data Generation Algorithms. arXiv:2112.09238, 2021.
  60. Differentially Private Learning Needs Better Features (or Much More Data). In International Conference on Learning Representations, 2021.
  61. Technical Privacy Metrics: A Systematic Survey. ACM Computing Surveys, 51(3), 2018.
  62. Managing re-identification risks while providing access to the All of Us research program. Journal of the American Medical Informatics Association, 30(5), 2023.
  63. Modeling Tabular data using Conditional GAN. In NeurIPS, 2019.
  64. Assessing privacy and quality of synthetic health data. In ACM Artificial Intelligence for Data Discovery and Reuse, 2019.
  65. Enhanced Membership Inference Attacks against Machine Learning Models. In ACM CCS, 2022.
  66. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In IEEE Computer Security Foundations Symposium, 2018.
  67. PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems, 42(4), 2017.
  68. Membership inference attacks against synthetic health data. Journal of Biomedical Informatics, 125, 2022.
  69. Data Forensics in Diffusion Models: A Systematic Analysis of Membership Privacy. arXiv:2302.07801, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
Citations (17)