Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentially Private Synthetic Data Using KD-Trees (2306.13211v1)

Published 19 Jun 2023 in cs.CR, cs.LG, and stat.ML

Abstract: Creation of a synthetic dataset that faithfully represents the data distribution and simultaneously preserves privacy is a major research challenge. Many space partitioning based approaches have emerged in recent years for answering statistical queries in a differentially private manner. However, for synthetic data generation problem, recent research has been mainly focused on deep generative models. In contrast, we exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms. We propose both data independent and data dependent algorithms for $\epsilon$-differentially private synthetic data generation whose kernel density resembles that of the real dataset. Additionally, we provide theoretical results on the utility-privacy trade-offs and show how our data dependent approach overcomes the curse of dimensionality and leads to a scalable algorithm. We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 459–468. IEEE Computer Society, 2006. 10.1109/FOCS.2006.49. URL https://doi.org/10.1109/FOCS.2006.49.
  2. Differentially private database release via kernel mean embeddings. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 423–431. PMLR, 2018. URL http://proceedings.mlr.press/v80/balog18a.html.
  3. Differentially private publication of sparse data, 2011.
  4. Locality-sensitive hashing scheme based on p-stable distributions. In Jack Snoeyink and Jean-Daniel Boissonnat, editors, Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004, pages 253–262. ACM, 2004. 10.1145/997817.997857. URL https://doi.org/10.1145/997817.997857.
  5. Local privacy and minimax bounds: Sharp rates for probability estimation, 2013. URL https://arxiv.org/abs/1305.6000.
  6. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  7. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  8. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw., 3:209–226, 09 1977. 10.1145/355744.355745.
  9. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In Gurpreet Dhillon, Fredrik Karlsson, Karin Hedström, and André Zúquete, editors, ICT Systems Security and Privacy Protection - 34th IFIP TC 11 International Conference, SEC 2019, Lisbon, Portugal, June 25-27, 2019, Proceedings, volume 562 of IFIP Advances in Information and Communication Technology, pages 151–164. Springer, 2019. 10.1007/978-3-030-22312-0_11. URL https://doi.org/10.1007/978-3-030-22312-0_11.
  10. Generative adversarial nets. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html.
  11. DP-MERF: differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1819–1827. PMLR, 2021. URL http://proceedings.mlr.press/v130/harder21a.html.
  12. A simple and practical algorithm for differentially private data release. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/208e43f0e45c4c78cafadb83d2888cb6-Paper.pdf.
  13. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Jeffrey Scott Vitter, editor, Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26, 1998, pages 604–613. ACM, 1998. 10.1145/276698.276876. URL https://doi.org/10.1145/276698.276876.
  14. PATE-GAN: generating synthetic data with differential privacy guarantees. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1zk9iRqF7.
  15. Synthetic data – what, why and how?, 2022.
  16. Privately learning high-dimensional distributions. CoRR, abs/1805.00216, 2018. URL http://arxiv.org/abs/1805.00216.
  17. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28:1302–1338, 2000.
  18. Differentially private data release for data mining. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 493–501. ACM, 2011. 10.1145/2020408.2020487. URL https://doi.org/10.1145/2020408.2020487.
  19. Semi-supervised knowledge transfer for deep learning from private training data, 2016. URL https://arxiv.org/abs/1610.05755.
  20. Scalable private learning with pate, 2018. URL https://arxiv.org/abs/1802.08908.
  21. Data synthesis based on generative adversarial networks. Proc. VLDB Endow., 11(10):1071–1083, 2018. 10.14778/3231751.3231757. URL http://www.vldb.org/pvldb/vol11/p1071-park.pdf.
  22. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom., 63(4):867–887, 2020. 10.1007/s00454-019-00134-6. URL https://doi.org/10.1007/s00454-019-00134-6.
  23. Differentially private grids for geospatial data. In Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 757–768. IEEE Computer Society, 2013. 10.1109/ICDE.2013.6544872. URL https://doi.org/10.1109/ICDE.2013.6544872.
  24. Which space partitioning tree to use for search? Advances in Neural Information Processing Systems, 26, 2013.
  25. Raif M. Rustamov. Closed-form expressions for maximum mean discrepancy with applications to wasserstein auto-encoders. Stat, 10(1):e329, 2021. https://doi.org/10.1002/sta4.329. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.329. e329 sta4.329.
  26. DP-CGAN: differentially private synthetic data and label generation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pages 98–104. Computer Vision Foundation / IEEE, 2019. 10.1109/CVPRW.2019.00018. URL http://openaccess.thecvf.com/content_CVPRW_2019/html/CV-COPS/Torkzadehmahani_DP-CGAN_Differentially_Private_Synthetic_Data_and_Label_Generation_CVPRW_2019_paper.html.
  27. Dpcube: Differentially private histogram release through multidimensional partitioning. Trans. Data Priv., 7(3):195–222, 2014. URL http://www.tdp.cat/issues11/abs.a136a13.php.
  28. Differentially private generative adversarial network. CoRR, abs/1802.06739, 2018. URL http://arxiv.org/abs/1802.06739.
  29. Privtree: A differentially private algorithm for hierarchical decompositions. In Fatma Özcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 155–170. ACM, 2016. 10.1145/2882903.2882928. URL https://doi.org/10.1145/2882903.2882928.
  30. Differentially private data publishing and analysis: A survey. IEEE Transactions on Knowledge and Data Engineering, 29(8):1619–1638, 2017. 10.1109/TKDE.2017.2697856.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Eleonora Kreačić (12 papers)
  2. Navid Nouri (8 papers)
  3. Vamsi K. Potluru (28 papers)
  4. Tucker Balch (61 papers)
  5. Manuela Veloso (105 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.