Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentially Private Data Generation with Missing Data (2310.11548v2)

Published 17 Oct 2023 in cs.DB and cs.CR

Abstract: Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. 2016-04-27. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). OJ (2016-04-27).
  2. Deep Learning with Differential Privacy. In CCS. ACM, 308–318.
  3. Privacy preserving synthetic data release using deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 510–526.
  4. John M. Abowd. 2018. The U.S. Census Bureau Adopts Differential Privacy. In KDD. 2867.
  5. Differentially private mixture of generative neural networks. IEEE Transactions on Knowledge and Data Engineering 31, 6 (2018), 1109–1121.
  6. Anish Agarwal and Rahul Singh. 2021. Causal inference with corrupted data: Measurement error, missing values, discretization, and differential privacy. arXiv preprint arXiv:2107.02780 (2021).
  7. The Saddle-Point Accountant for Differential Privacy. arXiv preprint arXiv:2208.09595 (2022).
  8. Should macroeconomic forecasters use daily financial data and how? Journal of Business & Economic Statistics 31, 2 (2013), 240–251.
  9. Rebecca R Andridge and Roderick JA Little. 2010. A review of hot deck imputation for survey non-response. International statistical review 78, 1 (2010), 40–64.
  10. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR, 214–223.
  11. Americans and Privacy - Concerned Confused and Feeling Lack of Control Over Their Personal Information. Pew Research Center (2019).
  12. Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in Neural Information Processing Systems 31 (2018).
  13. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS. 273–282.
  14. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In FOCS. 464–473.
  15. Claire McKay Bowen and Fang Liu. 2020. Comparative Study of Differentially Private Data Synthesis Methods. Statist. Sci. 35, 2 (May 2020), 280–307. https://doi.org/10.1214/19-sts742
  16. Differentially private release and learning of threshold functions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 634–649.
  17. U.S. Census Bureau. Accessed on 2020-11-30. LEHD Origin-Destination Employment Statistics (2002-2017). https://onthemap.ces.census.gov/
  18. Ron-gauss: Enhancing utility in non-interactive private data release. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 26–46.
  19. R. Chawla. 2019. Deepfakes : How a pervert shook the world. International Journal for Advance Research and Development 4 (2019), 4–8.
  20. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. Advances in Neural Information Processing Systems 33 (2020), 12673–12684.
  21. Differentially Private Data Generative Models. CoRR abs/1812.02274 (2018).
  22. Differentially Private High-Dimensional Data Publication via Sampling-Based Inference. In SIGKDD. 129–138.
  23. Differentially Private k-Nearest Neighbor Missing Data Imputation. ACM Trans. Priv. Secur. 25, 3 (2022), 16:1–16:23. https://doi.org/10.1145/3507952
  24. Imputation under Differential Privacy. CoRR abs/2206.15063 (2022). https://doi.org/10.48550/arXiv.2206.15063
  25. Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  26. Cynthia Dwork. 2006. Differential Privacy. In ICALP, Vol. 4052. Springer, 1–12.
  27. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In EUROCRYPT, Vol. 4004. Springer, 486–503.
  28. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC ’06). 265–284.
  29. Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science 9, 3-4 (2014), 211–407.
  30. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In CCS. ACM, 1054–1067.
  31. Liyue Fan. 2020. A Survey of Differentially Private Generative Adversarial Networks. In The AAAI Workshop on Privacy-Preserving Artificial Intelligence.
  32. Differentially Private Generative Adversarial Networks for Time Series, Continuous, and Discrete Open Data. In SEC. 151–164.
  33. APEx: Accuracy-Aware Differentially Private Data Exploration. In SIGMOD. 177–194.
  34. Kamino: Constraint-aware differentially private data synthesis. Proceedings of the VLDB Endowment 14, 10 (2021), 1886–1899.
  35. Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data. Epidemiology (2023), 10–1097.
  36. Numerical composition of differential privacy. Advances in Neural Information Processing Systems 34 (2021), 11631–11642.
  37. Andy Greenberg. 2016. Apple’s ‘Differential Privacy’ Is About Collecting Your Data—But Not Your Data. Wired (2016).
  38. Rahul Gupta. 2019. Data Augmentation for Low Resource Sentiment Analysis Using Generative Adversarial Networks. In ICASSP. IEEE, 7380–7384.
  39. Michael B. Hawes. 2020. Implementing Differential Privacy: Seven Lessons From the 2020 United States Census. Harvard Data Science Review (30 4 2020). https://doi.org/10.1162/99608f92.353c6f99 https://hdsr.mitpress.mit.edu/pub/dgg03vo6.
  40. PACAS: Privacy-aware, data cleaning-as-a-service. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 1023–1030.
  41. IBM. 2020. Cost of a Data Breach Report. (2020).
  42. Geetha Jagannathan and Rebecca N Wright. 2008. Privacy-preserving imputation of missing data. Data & Knowledge Engineering 65, 1 (2008), 40–56.
  43. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine 50, 2 (2010), 105–115.
  44. Towards Practical Differential Privacy for SQL Queries. PVLDB 11, 5 (2018), 526–539.
  45. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In ICLR.
  46. The use of medical records in research: what do patients want? The Journal of Law, Medicine & Ethics 31, 3 (2003), 429–433.
  47. Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models - Principles and Techniques. MIT Press.
  48. PrivateSQL: A Differentially Private SQL Query Engine. PVLDB 12, 11 (2019), 1371–1384.
  49. Privateclean: Data cleaning and differential privacy. In Proceedings of the 2016 International Conference on Management of Data. 937–951.
  50. Imputation of Missing Data Using Machine Learning Techniques.. In KDD, Vol. 96.
  51. The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24, 6 (2015), 757–781.
  52. DPSynthesizer: Differentially Private Data Synthesizer for Privacy Preserving Data Sharing. Proc. VLDB Endow. 7, 13 (2014), 1677–1680.
  53. Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprint arXiv:1902.09599 (2019).
  54. Roderick JA Little. 1988. A test of missing completely at random for multivariate data with missing values. Journal of the American statistical Association 83, 404 (1988), 1198–1202.
  55. Roderick JA Little. 1994. A class of pattern-mixture models for normal incomplete data. Biometrika 81, 3 (1994), 471–483.
  56. Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing data. Vol. 793. John Wiley & Sons.
  57. E2gan: End-to-end generative adversarial network for multivariate time series imputation. In Proceedings of the 28th international joint conference on artificial intelligence. AAAI Press, 3094–3100.
  58. Graphical-model based estimation and inference for differential privacy. In ICML, Vol. 97. 4435–4444.
  59. Frank McSherry. 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 19–30.
  60. Ilya Mironov. 2017. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). IEEE, 263–275.
  61. Rényi Differential Privacy of the Sampled Gaussian Mechanism. CoRR abs/1908.10530 (2019). arXiv:1908.10530 http://arxiv.org/abs/1908.10530
  62. The Role of Adaptive Optimizers for Honest Private Hyperparameter Selection. arXiv preprint arXiv:2111.04906 (2021).
  63. Differentially Private Data Generation with Missing Data. https://github.com/mshubhankar/DP-DataGeneration-MissingData.
  64. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62 (2014), 22–31.
  65. Missing data imputation using optimal transport. In International Conference on Machine Learning. PMLR, 7130–7140.
  66. Shinichi Nakagawa and Robert P Freckleton. 2008. Missing inaction: the dangers of ignoring missing data. Trends in ecology & evolution 23, 11 (2008), 592–596.
  67. Eric Schulte Nordholt. 1998. Imputation: methods, simulation experiments and practical examples. International Statistical Review 66, 2 (1998), 157–180.
  68. Gretel. ai: Open-Source Artificial Intelligence Tool To Generate New Synthetic Data. ([n. d.]).
  69. National Institute of Standards and Technology. 2018. Differential Privacy Synthetic Data Challenge. https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic
  70. Scalable Private Learning with PATE. In ICLR.
  71. Nicolas Papernot and Thomas Steinke. [n.d.]. Hyperparameter Tuning with Renyi Differential Privacy. In International Conference on Learning Representations.
  72. The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49
  73. Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In Thirtieth AAAI Conference on Artificial Intelligence.
  74. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In SSDBM. ACM, 42:1–42:5.
  75. Forgetting personal data and revoking consent under the GDPR: Challenges and proposed solutions. Journal of cybersecurity 4, 1 (2018), tyy001.
  76. PriView: practical differentially private release of marginal contingency tables. In SIGMOD. 1435–1446.
  77. Donald B Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.
  78. Learning from Simulated and Unsupervised Images through Adversarial Training. In CVPR. IEEE Computer Society, 2242–2251.
  79. Stochastic gradient descent with differentially private updates. In GlobalSIP. 245–248.
  80. Thomas Steinke. 2022. Composition of Differential Privacy & Privacy Amplification by Subsampling. arXiv preprint arXiv:2210.00597 (2022).
  81. Benchmarking Differentially Private Synthetic Data Generation Algorithms. arXiv preprint arXiv:2112.09238 (2021).
  82. NIST Diverse Community Excerpts Data. https://doi.org/10.18434/MDS2-2895
  83. Dp-cgan: Differentially private synthetic data and label generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
  84. DP-CGAN: Differentially Private Synthetic Data and Label Generation. CoRR abs/2001.09700 (2020).
  85. Stef Van Buuren. 2018. Flexible imputation of missing data. CRC press.
  86. Christopher Waites. 2019. PyVacy: Towards Practical Differential Privacy for Deep Learning. https://github.com/ChrisWaites/pyvacy (2019).
  87. Oliver Williams and Frank McSherry. 2010. Probabilistic Inference and Differential Privacy. In NIPS. 2451–2459.
  88. Differential Privacy via Wavelet Transforms. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1200–1214.
  89. Differentially Private Generative Adversarial Network. CoRR abs/1802.06739 (2018).
  90. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019).
  91. scIGANs: single-cell RNA-seq imputation using generative adversarial networks. Nucleic acids research 48, 15 (2020), e85–e85.
  92. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning. PMLR, 5689–5698.
  93. PrivBayes: private data release via bayesian networks. In SIGMOD. 1423–1434.
  94. Differentially Private Data Publishing and Analysis: A Survey. IEEE Trans. Knowl. Data Eng. 29, 8 (2017), 1619–1638.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shubhankar Mohapatra (6 papers)
  2. Jianqiao Zong (1 paper)
  3. Florian Kerschbaum (50 papers)
  4. Xi He (57 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.