Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficiently Computing Similarities to Private Datasets (2403.08917v1)

Published 13 Mar 2024 in cs.CR, cs.DS, and cs.LG

Abstract: Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e{-|x-y|_22/\sigma2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = |x-y|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp.  308–318, 2016.
  2. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, 2009. doi: 10.1137/060673096. URL https://doi.org/10.1137/060673096.
  3. Francesco Aldà and Benjamin I. P. Rubinstein. The bernstein mechanism: Function release under differential privacy. In Satinder Singh and Shaul Markovitch (eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp.  1705–1711. AAAI Press, 2017. doi: 10.1609/aaai.v31i1.10884. URL https://doi.org/10.1609/aaai.v31i1.10884.
  4. Efficient density evaluation for smooth kernels. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp.  615–626, 2018.
  5. Space and time efficient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS, pp. 15773–15782, 2019.
  6. Subquadratic algorithms for kernel matrices via kernel density estimation. In The Eleventh International Conference on Learning Representations, 2022.
  7. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, pp.  1039–1050, 2019.
  8. Bounds on the sample complexity for private learning and private data release. Machine learning, 94:401–437, 2014.
  9. The johnson-lindenstrauss transform itself preserves differential privacy. In 53rd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2012, New Brunswick, NJ, USA, October 20-23, 2012, pp. 410–419. IEEE Computer Society, 2012. doi: 10.1109/FOCS.2012.67. URL https://doi.org/10.1109/FOCS.2012.67.
  10. A learning theory approach to noninteractive database privacy. J. ACM, 60(2):12:1–12:25, 2013. doi: 10.1145/2450142.2450148. URL https://doi.org/10.1145/2450142.2450148.
  11. Random projections for k𝑘kitalic_k-means clustering. Advances in neural information processing systems, 23, 2010.
  12. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284, 2019.
  13. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  14. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. IEEE, 2022.
  15. Quantifying memorization across neural language models. International Conference on Learning Representations, 2023a.
  16. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023b.
  17. The johnson-lindenstrauss lemma for clustering and subspace approximation: From coresets to dimension reduction. CoRR, abs/2205.00371, 2022. doi: 10.48550/arXiv.2205.00371. URL https://doi.org/10.48550/arXiv.2205.00371.
  18. Kernel density estimation through density constrained near neighbor search. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp.  172–183, 2020.
  19. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  343–362, 2020a.
  20. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020b.
  21. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp.  163–172, 2015.
  22. A one-pass distributed and private sketch for kernel sums with applications to machine learning at scale. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp.  3252–3265, 2021.
  23. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
  24. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  25. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp.  265–284. Springer, 2006.
  26. Boosting and differential privacy. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pp. 51–60. IEEE Computer Society, 2010. doi: 10.1109/FOCS.2010.12. URL https://doi.org/10.1109/FOCS.2010.12.
  27. Private coresets. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pp.  361–370, 2009.
  28. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp.  1322–1333, 2015.
  29. Iterative constructions and private data release. In Ronald Cramer (ed.), Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012, Taormina, Sicily, Italy, March 19-21, 2012. Proceedings, volume 7194 of Lecture Notes in Computer Science, pp.  339–356. Springer, 2012. doi: 10.1007/978-3-642-28914-9_19. URL https://doi.org/10.1007/978-3-642-28914-9_19.
  30. Reconstructing training data from trained neural networks. Advances in Neural Information Processing Systems, 35:22911–22924, 2022.
  31. Differential privacy for functions and functional data. The Journal of Machine Learning Research, 14(1):703–727, 2013.
  32. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing, pp.  705–714, 2010.
  33. Kernel methods in machine learning. The annals of statistics, 36(3):1171–1220, 2008.
  34. Privately customizing prefinetuning to better match user data in federated learning. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML, 2023.
  35. Exploiting metric structure for efficient private query release. In Chandra Chekuri (ed.), Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pp.  523–534. SIAM, 2014. doi: 10.1137/1.9781611973402.39. URL https://doi.org/10.1137/1.9781611973402.39.
  36. Nearest-neighbor-preserving embeddings. ACM Transactions on Algorithms (TALG), 3(3):31–es, 2007.
  37. Faster linear algebra for distance matrices. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  35576–35589. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/e7599c4b309e39e444a7dcf92572fae1-Paper-Conference.pdf.
  38. Dimensionality reduction for wasserstein barycenter. Advances in neural information processing systems, 34:15582–15594, 2021.
  39. W. Johnson and J. Lindenstrauss. Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics, 26:189–206, 01 1984. doi: 10.1090/conm/026/737400.
  40. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021.
  41. Using gans for sharing networked time series data: Challenges, initial promise, and open questions. In Proceedings of the ACM Internet Measurement Conference, pp.  464–483, 2020.
  42. Differentially private synthetic data via foundation model apis 1: Images. arXiv preprint arXiv:2305.15560, 2023.
  43. Dimensionality reduction for general kde mode finding. In International Conference on Machine Learning. PMLR, 2023.
  44. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Moses Charikar and Edith Cohen (eds.), Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pp.  1027–1038. ACM, 2019. doi: 10.1145/3313276.3316350. URL https://doi.org/10.1145/3313276.3316350.
  45. Jirí Matousek. Lectures on discrete geometry, volume 212 of Graduate texts in mathematics. Springer, 2002. ISBN 978-0-387-95373-1.
  46. Randomized dimensionality reduction for facility location and single-linkage clustering. In International Conference on Machine Learning, pp. 7948–7957. PMLR, 2021.
  47. Near-optimal coresets of kernel density estimates. Discret. Comput. Geom., 63(4):867–887, 2020. doi: 10.1007/s00454-019-00134-6. URL https://doi.org/10.1007/s00454-019-00134-6.
  48. How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113–1201, 2023.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  50. Faster algorithms via approximation theory. Found. Trends Theor. Comput. Sci., 9(2):125–210, 2014. doi: 10.1561/0400000065. URL https://doi.org/10.1561/0400000065.
  51. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  52. Kernel methods for pattern analysis. Cambridge university press, 2004.
  53. Privately learning subspaces. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 1312–1324, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/09b69adcd7cbae914c6204984097d2da-Abstract.html.
  54. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  55. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp.  2779–2792, 2022.
  56. Salil Vadhan. The Complexity of Differential Privacy, pp.  347–450. Springer, Yehuda Lindell, ed., 2017a. URL https://link.springer.com/chapter/10.1007/978-3-319-57048-8_7.
  57. Salil Vadhan. The Complexity of Differential Privacy, pp.  347–450. Springer, Yehuda Lindell, ed., 2017b. URL https://link.springer.com/chapter/10.1007/978-3-319-57048-8_7.
  58. Fast private kernel density estimation via locality sensitive quantization. In International Conference on Machine Learning. PMLR, 2023.
  59. Differentially private data releasing for smooth queries. The Journal of Machine Learning Research, 17(1):1779–1820, 2016.
  60. Practical gan-based synthetic ip header trace generation using netshare. In Proceedings of the ACM SIGCOMM 2022 Conference, pp. 458–472, 2022.
  61. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. In International Conference on Learning Representations, 2020.
  62. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021.
  63. Selective pre-training for private fine-tuning. arXiv preprint arXiv:2305.13865, 2023.
  64. Synthetic text generation with differential privacy: A simple and practical recipe. arXiv preprint arXiv:2210.14348, 2022.
  65. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Citations (1)

Summary

We haven't generated a summary for this paper yet.