Does Differentially Private Synthetic Data Lead to Synthetic Discoveries? (2403.13612v2)
Abstract: Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $\epsilon\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($\epsilon\geq 5$) in order to have reasonable Type II error.
- K. El Emam, S. Rodgers, and B. Malin, “Anonymising and sharing individual patient data,” bmj, vol. 350, 2015.
- D. B. Rubin, “Statistical disclosure limitation,” Journal of official Statistics, vol. 9, no. 2, pp. 461–468, 1993.
- R. J. Chen, M. Y. Lu, T. Y. Chen, D. F. Williamson, and F. Mahmood, “Synthetic data in machine learning for medicine and healthcare,” Nature Biomedical Engineering, vol. 5, no. 6, pp. 493–497, 2021.
- M. Hernadez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions,” Methods of Information in Medicine, 2023.
- J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, G. Cherubin, C. Maple, S. N. Cohen, and A. Weller, “Synthetic Data–what, why and how?,” arXiv preprint arXiv:2205.03257, 2022.
- D. Chen, N. Yu, Y. Zhang, and M. Fritz, “GAN-leaks: a taxonomy of membership inference attacks against generative models,” in Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 343–362, 2020.
- J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro, “Logan: Membership inference attacks against generative models,” arXiv preprint arXiv:1705.07663, 2017.
- T. Stadler, B. Oprisanu, and C. Troncoso, “Synthetic data–anonymisation groundhog day,” in 31st USENIX Security Symposium (USENIX Security 22), pp. 1451–1468, 2022.
- N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song, “The secret sharer: Evaluating and testing unintended memorization in neural networks.,” in USENIX Security Symposium, vol. 267, 2019.
- A. F. Karr, C. N. Kohnen, A. Oganian, J. P. Reiter, and A. P. Sanil, “A framework for evaluating the utility of data altered to protect confidentiality,” The American Statistician, vol. 60, no. 3, pp. 224–232, 2006.
- M. Boedihardjo, T. Strohmer, and R. Vershynin, “Covariance’s loss is privacy’s gain: Computationally efficient, private and accurate synthetic data,” Foundations of Computational Mathematics, pp. 1–48, 2022.
- C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference, pp. 265–284, Springer, 2006.
- C. Dwork, A. Roth, et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
- L. Wasserman and S. Zhou, “A statistical framework for differential privacy,” Journal of the American Statistical Association, vol. 105, pp. 375–389, mar 2010.
- M. Gong, Y. Xie, K. Pan, K. Feng, and A. K. Qin, “A Survey on Differentially Private Machine Learning [Review Article],” in IEEE Computational Intelligence Magazine, vol. 15, pp. 49–64, Institute of Electrical and Electronics Engineers Inc., may 2020.
- J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett, “Differentially private histogram publication,” The VLDB journal, vol. 22, no. 6, pp. 797–822, 2013.
- M. Gaboardi, H. Lim, R. Rogers, and S. Vadhan, “Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing,” in International conference on machine learning, pp. 2111–2120, PMLR, 2016.
- C. Task and C. Clifton, “Differentially private significance testing on paired-sample data,” in Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 153–161, SIAM, 2016.
- S. Couch, Z. Kazan, K. Shi, A. Bray, and A. Groce, “Differentially private nonparametric hypothesis testing,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 737–751, 2019.
- C. Ferrando, S. Wang, and D. Sheldon, “Parametric bootstrap for differentially private confidence intervals,” in International Conference on Artificial Intelligence and Statistics, pp. 1598–1618, PMLR, 2022.
- K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization.,” Journal of Machine Learning Research, vol. 12, no. 3, 2011.
- M. Hardt, K. Ligett, and F. McSherry, “A simple and practical algorithm for differentially private data release,” Advances in neural information processing systems, vol. 25, 2012.
- H. Ping, J. Stoyanovich, and B. Howe, “Datasynthesizer: Privacy-preserving synthetic datasets,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5, 2017.
- J. Snoke and A. Slavković, “pMSE mechanism: differentially private synthetic data with maximal distributional similarity,” in International conference on privacy in statistical databases, pp. 138–159, Springer, 2018.
- D. Chen, T. Orekondy, and M. Fritz, “GS-WGAN: A gradient-sanitized approach for learning differentially private generators,” Advances in Neural Information Processing Systems, vol. 33, pp. 12673–12684, 2020.
- R. McKenna, G. Miklau, and D. Sheldon, “Winning the NIST Contest: A scalable and general approach to differentially private synthetic data,” Journal of Privacy and Confidentiality, vol. 11, dec 2021.
- N. Nachar et al., “The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution,” Tutorials in quantitative Methods for Psychology, vol. 4, no. 1, pp. 13–20, 2008.
- J. Zar, Biostatistical Analysis. Prentice Hall, 2010.
- U. Okeh et al., “Statistical analysis of the application of Wilcoxon and Mann-Whitney U test in medical research studies,” Biotechnology and molecular biology reviews, vol. 4, no. 6, pp. 128–131, 2009.
- M. P. Fay and M. A. Proschan, “Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules,” Statistics surveys, vol. 4, p. 1, 2010.
- G. Casella and R. L. Berger, Statistical inference. Pacific Grove, CA, USA: Duxbury Press, 2 ed., 2002.
- C. Arnold and M. Neunhoeffer, “Really useful synthetic data–a framework to evaluate the quality of differentially private synthetic data,” arXiv preprint arXiv:2004.07740, 2020.
- M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318, 2016.
- N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and L. Sweeney, “Privacy preserving synthetic data release using deep learning,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, pp. 510–526, Springer, 2019.
- J. Jordon, J. Yoon, and M. Van Der Schaar, “PATE-GaN: Generating synthetic data with differential privacy guarantees,” in 7th International Conference on Learning Representations, ICLR 2019, feb 2019.
- C. M. Bowen and J. Snoke, “Comparative study of differentially private synthetic data algorithms from the nist pscr differential privacy synthetic data challenge,” J. Priv. Confidentiality, vol. 11, 2019.
- C. M. Bowen and F. Liu, “Comparative Study of Differentially Private Data Synthesis Methods,” Statistical Science, vol. 35, no. 2, pp. 280 – 307, 2020.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
- F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
- H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947.
- I. Jambor, P. J. Boström, P. Taimen, K. Syvänen, E. Kähkönen, M. Kallajoki, I. M. Perez, T. Kauko, J. Matomäki, O. Ettala, et al., “Novel biparametric mri and targeted biopsy improves risk stratification in men with a clinical suspicion of prostate cancer (improd trial),” Journal of Magnetic Resonance Imaging, vol. 46, no. 4, pp. 1089–1095, 2017.
- I. Jambor, J. Verho, O. Ettala, J. Knaapila, P. Taimen, K. T. Syvänen, A. Kiviniemi, E. Kähkönen, I. M. Perez, M. Seppänen, et al., “Validation of improd biparametric mri in men with clinically suspected prostate cancer: a prospective multi-institutional trial,” PLoS medicine, vol. 16, no. 6, p. e1002813, 2019.
- T. A. Stamey, N. Yang, A. R. Hay, J. E. McNeal, F. S. Freiha, and E. Redwine, “Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate,” New England Journal of Medicine, vol. 317, no. 15, pp. 909–916, 1987.
- W. J. Catalona, D. S. Smith, T. L. Ratliff, K. M. Dodds, D. E. Coplen, J. J. Yuan, J. A. Petros, and G. L. Andriole, “Measurement of prostate-specific antigen in serum as a screening test for prostate cancer,” New England journal of medicine, vol. 324, no. 17, pp. 1156–1161, 1991.
- S. Ulianova, “Cardiovascular Disease dataset — Kaggle,” 2019.
- S. C. Larsson, M. Bäck, J. M. Rees, A. M. Mason, and S. Burgess, “Body mass index and body composition in relation to 14 cardiovascular conditions in UK Biobank: A Mendelian randomization study,” European Heart Journal, vol. 41, pp. 221–226, jan 2020.
- C. L. Canonne, G. Kamath, and T. Steinke, “The Discrete Gaussian for Differential Privacy,” Journal of Privacy and Confidentiality, vol. 12, no. 1, 2022.
- R. McKenna, G. Miklau, and D. Sheldon, “Private-PGM,” 2021.
- M. Hardt, K. Ligett, and F. McSherry, “PrivateMultiplicativeWeights (MWEM),” 2020.
- D. Chen, “GS-WGAN github-repository.” https://github.com/DingfanChen/GS-WGAN, 2020.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in neural information processing systems, vol. 30, 2017.
- A.-S. Charest, “How can we analyze differentially-private synthetic datasets?,” Journal of Privacy and Confidentiality, vol. 2, no. 2, 2011.
- A.-S. Charest, “Empirical evaluation of statistical inference from differentially-private contingency tables,” in Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2012, Palermo, Italy, September 26-28, 2012. Proceedings, pp. 257–272, Springer, 2012.
- O. Giles, K. Hosseini, G. Mingas, O. Strickson, L. Bowler, C. R. Smith, H. Wilde, J. N. Lim, B. Mateen, K. Amarasinghe, et al., “Faking feature importance: A cautionary tale on the use of differentially-private synthetic data,” arXiv preprint arXiv:2203.01363, 2022.
- O. Räisä, J. Jälkö, S. Kaski, and A. Honkela, “Noise-aware statistical inference with differentially private synthetic data,” in International Conference on Artificial Intelligence and Statistics, pp. 3620–3643, PMLR, 2023.
- R. McKenna, B. Mullins, D. Sheldon, and G. Miklau, “Aim: An adaptive and iterative mechanism for differentially private synthetic data,” arXiv preprint arXiv:2201.12677, 2022.
- H. Dörie, 100 great problems of elementary mathematics: Their history and solutions. Dover Publications, 1965. pp. 44-48.
- Ileana Montoya Perez (2 papers)
- Parisa Movahedi (3 papers)
- Valtteri Nieminen (1 paper)
- Antti Airola (19 papers)
- Tapio Pahikkala (20 papers)