Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning (2307.01875v1)

Published 4 Jul 2023 in cs.LG and cs.CR

Abstract: The availability of large amounts of informative data is crucial for successful machine learning. However, in domains with sensitive information, the release of high-utility data which protects the privacy of individuals has proven challenging. Despite progress in differential privacy and generative modeling for privacy-preserving data release in the literature, only a few approaches optimize for machine learning utility: most approaches only take into account statistical metrics on the data itself and fail to explicitly preserve the loss metrics of machine learning models that are to be subsequently trained on the generated data. In this paper, we introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning, while preserving differential privacy. We also describe a specific implementation of this framework that leverages mixture models to approximate, kernel-inducing points to adapt, and Gaussian differential privacy to anonymize a dataset, in order to ensure that the resulting data is both privacy-preserving and high utility. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets, when evaluated on held-out real data. We also compare our results with several privacy-preserving synthetic data generation models (such as differentially private generative adversarial networks), and report significant increases in classification performance metrics compared to state-of-the-art models. These favorable comparisons show that the presented framework is a promising direction of research, increasing the utility of low-risk synthetic data release for machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 439–450.
  2. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, 394–403. PMLR.
  3. Bishop, C. M. 2006. Pattern recognition. Machine learning, 128(9).
  4. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2): 1–25.
  5. Deep Learning with Gaussian Differential Privacy.
  6. Vicinal risk minimization. Advances in neural information processing systems, 416–422.
  7. Towards formalizing the GDPR’s notion of singling out. Proceedings of the National Academy of Sciences, 117(15): 8344–8352.
  8. Hebo: Heteroscedastic evolutionary bayesian optimisation. arXiv e-prints, arXiv–2012.
  9. Gaussian differential privacy. arXiv preprint arXiv:1905.02383.
  10. Dwork, C. 2006. Differential privacy. In International Colloquium on Automata, Languages, and Programming, 1–12. Springer.
  11. Generative adversarial nets. Advances in neural information processing systems, 27.
  12. A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25.
  13. Inoue, H. 2018. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929.
  14. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31.
  15. On controlling the size of clusters in probabilistic clustering. In Thirty-Second AAAI Conference on Artificial Intelligence.
  16. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations.
  17. Straggler mitigation in distributed optimization through data encoding. Advances in Neural Information Processing Systems, 30: 5434–5442.
  18. What can we learn privately? SIAM Journal on Computing, 40(3): 793–826.
  19. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
  20. Privacy via the johnson-lindenstrauss transform. arXiv preprint arXiv:1204.2606.
  21. Synthesizing differentially private datasets using random mixing. In 2019 IEEE International Symposium on Information Theory (ISIT), 542–546. IEEE.
  22. SGD on Random Mixtures: Private Machine Learning under Data Breach Threats. ICLR 2018 Workshop.
  23. Finite mixture models. Annual review of statistics and its application, 6: 355–378.
  24. Privacy via pseudorandom sketches. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 143–152.
  25. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 493–501.
  26. Dataset Meta-Learning from Kernel Ridge-Regression. In International Conference on Learning Representations.
  27. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34.
  28. The Nelder-Mead simplex procedure for function minimization. Technometrics, 17(1): 45–51.
  29. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755.
  30. Scalable private learning with pate. arXiv preprint arXiv:1802.08908.
  31. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739.
  32. DPPro: Differentially private high-dimensional data release via random projection. IEEE Transactions on Information Forensics and Security, 12(12): 3081–3093.
  33. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE journal of biomedical and health informatics, 24(8): 2378–2388.
  34. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
  35. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4): 1–41.
  36. Differentially private data publishing and analysis: A survey. IEEE Transactions on Knowledge and Data Engineering, 29(8): 1619–1638.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tamas Madl (4 papers)
  2. Weijie Xu (28 papers)
  3. Olivia Choudhury (5 papers)
  4. Matthew Howard (18 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.