Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantifying and Mitigating Privacy Risks for Tabular Generative Models (2403.07842v1)

Published 12 Mar 2024 in cs.LG and cs.CR

Abstract: Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://eur-lex.europa.eu/eli/reg/2016/679/oj.
  2. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
  3. Data poisoning attacks against autoregressive models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
  4. Spot the fake lungs: Generating synthetic medical images using neural diffusion models. In Irish Conference on Artificial Intelligence and Cognitive Science. Springer, 32–39.
  5. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR, 214–223.
  6. Thera Bank. 2017. Bank Loan Modelling. https://www.kaggle.com/datasets/itsmesunil/bank-loan-modelling
  7. Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20.
  8. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.
  9. Extracting Training Data From Diffusion Models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270.
  10. RoentGen: vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737 (2022).
  11. Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
  12. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. Advances in Neural Information Processing Systems 33 (2020), 12673–12684.
  13. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 343–362.
  14. Rocgan: Robust conditional gan. International Journal of Computer Vision 128 (2020), 2665–2683.
  15. Ambient diffusion: Learning clean distributions from corrupted data. Advances in Neural Information Processing Systems 36 (2024).
  16. Kuzak Dempsy. 2021. Cardiovascular Disease Dataset. https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas
  17. Trojan attack on deep generative models in autonomous driving. In Security and Privacy in Communication Networks: 15th EAI International Conference, SecureComm 2019, Orlando, FL, USA, October 23-25, 2019, Proceedings, Part I 15. Springer, 299–318.
  18. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014).
  19. Differentially Private Diffusion Models. CoRR abs/2210.09929 (2022).
  20. Gaussian Differential Privacy. Journal of the Royal Statistical Society (2021).
  21. Are Diffusion Models Vulnerable to Membership Inference Attacks?. In International Conference on Machine Learning, ICML, Vol. 202. 8717–8730.
  22. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3. Springer, 265–284.
  23. A Probabilistic Fluctuation based Membership Inference Attack for Generative Models. arXiv preprint arXiv:2308.12143 (2023).
  24. Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks. CoRR abs/2306.13103 (2023).
  25. Bayesian data analysis. Chapman and Hall/CRC.
  26. Differentially Private Diffusion Models Generate Useful Synthetic Images. CoRR abs/2302.13861 (2023).
  27. A Unified Framework for Quantifying Privacy Risk in Synthetic Data. Proceedings on Privacy Enhancing Technologies 2 (2023), 312–328.
  28. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27.
  29. LOGAN: Membership Inference Attacks Against Generative Models. In Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019. De Gruyter, 133–152.
  30. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems.
  31. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems 34 (2021), 12454–12465.
  32. TAPAS: a toolbox for adversarial privacy auditing of synthetic data. (2022).
  33. Thompson sampling with diffusion generative prior. In ICML 2023. https://www.amazon.science/publications/thompson-sampling-with-diffusion-generative-prior
  34. Hailong Hu and Jun Pang. 2021. Model extraction and defenses on generative adversarial networks. arXiv preprint arXiv:2101.02069 (2021).
  35. Hailong Hu and Jun Pang. 2023. Membership Inference of Diffusion Models. CoRR abs/2301.09956 (2023).
  36. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations.
  37. Elucidating the Design Space of Diffusion-Based Generative Models. In NeurIPS.
  38. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  39. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. PMLR, 17564–17579.
  40. Differentially Private Normalizing Flows for Synthetic Tabular Data Generation. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI, IAAI, EAAI. 7345–7353.
  41. Performing co-membership attacks against deep generative models. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 459–467.
  42. Geoffrey J McLachlan and Kaye E Basford. 1988. Mixture models: Inference and applications to clustering. Vol. 38. M. Dekker New York.
  43. Raphaël Millière. 2022. Adversarial Attacks on Image Generation With Made-Up Words. CoRR abs/2208.04135 (2022).
  44. privGAN: Protecting GANs from membership inference attacks at low cost to utility. Proc. Priv. Enhancing Technol. 2021, 3 (2021), 142–163.
  45. Roger B Nelsen. 2006. An introduction to copulas. Springer.
  46. Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent. arXiv preprint arXiv:2307.11939 (2023).
  47. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR abs/2204.06125 (2022).
  48. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 10674–10685.
  49. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In NeurIPS.
  50. Membership Inference Attacks Against Machine Learning Models. In IEEE Symposium on Security and Privacy, SP. 3–18.
  51. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML, Vol. 37. 2256–2265.
  52. Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 6048–6058.
  53. Synthetic data-A privacy mirage. arXiv preprint arXiv:2011.07018 (2020).
  54. Synthetic data–anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22). 1451–1468.
  55. Synthetic Data Vault. 2023. CopulaGAN Synthesizer Documentation. https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer
  56. Intriguing Properties of Neural Networks. In 2nd International Conference on Learning Representations, ICLR.
  57. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems. 24804–24816.
  58. Luís Torgo. 1990. California Housing Prices. https://www.kaggle.com/datasets/camnugent/california-housing-prices
  59. Chris Waites and Rachel Cummings. 2021. Differentially Private Normalizing Flows for Privacy-Preserving Density Estimation. In AIES ’21: AAAI/ACM Conference on AI. 1000–1009.
  60. Unified Enhancement of Privacy Bounds for Mixture Mechanisms via f𝑓fitalic_f-Differential Privacy. Advances in Neural Information Processing Systems 36 (2024).
  61. Membership Inference Attacks Against Text-to-image Generation Models. CoRR abs/2210.00968 (2022).
  62. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018).
  63. Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).
  64. Poisson Flow Generative Models. In NeurIPS.
  65. Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340 (2017).
  66. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics 24, 8 (2020), 2378–2388.
  67. Opacus: User-Friendly Differential Privacy Library in PyTorch. CoRR (2021). https://arxiv.org/abs/2109.12298
  68. Data Forensics in Diffusion Models: A Systematic Analysis of Membership Privacy. CoRR abs/2302.07801 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chaoyi Zhu (10 papers)
  2. Jiayi Tang (4 papers)
  3. Hans Brouwer (4 papers)
  4. Juan F. Pérez (1 paper)
  5. Marten van Dijk (36 papers)
  6. Lydia Y. Chen (47 papers)
Citations (3)