Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets (2305.17148v3)
Abstract: Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.
- Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. US Census Bureau, 2019.
- The 2020 census disclosure avoidance system topdown algorithm. Harvard Data Science Review, (Special Issue 2), 2022.
- Fast computation of low-rank matrix approximation. In Proceedings of the thirty-P third annual ACM symposium on Theory of computing, pages 611–618. ACM, 2001.
- Differentially private covariance estimation. Advances in Neural Information Processing Systems, 32, 2019.
- Differentially private robust low-rank approximation. Advances in neural information processing systems, 31, 2018.
- Differentially private database release via kernel mean embeddings. In International Conference on Machine Learning, pages 414–422. PMLR, 2018.
- Privacy and synthetic datasets. Stan. Tech. L. Rev., 22:1, 2019.
- Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
- A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
- Covariance’s loss is privacy’s gain: Computationally efficient, private and accurate synthetic data. Foundations of Computational Mathematics, pages 1–48, 2022.
- Private measures, random walks, and synthetic data. arXiv preprint arXiv:2204.09167, 2022.
- Private sampling: a noiseless approach for generating differentially private synthetic data. SIAM Journal on Mathematics of Data Science, 4(3):1082–1115, 2022.
- A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822, 2021.
- Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011.
- A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14, 2013.
- Tail bounds on the spectral norm of sub-exponential random matrices. arXiv preprint arXiv:2212.07600, 2022.
- The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
- Differentially private covariance revisited. Advances in Neural Information Processing Systems, 35:850–861, 2022.
- Certified private data release for sparse Lipschitz functions. arXiv preprint arXiv:2302.09680, 2023.
- Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology-EUROCRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1, 2006. Proceedings 25, pages 486–503. Springer, 2006.
- Differential privacy in practice: Expose your epsilons! Journal of Privacy and Confidentiality, 9(2), 2019.
- On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 381–390, 2009.
- Efficient algorithms for privately releasing marginals via convex relaxations. Discrete & Computational Geometry, 53:650–673, 2015.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11–20, 2014.
- Spectral techniques applied to sparse random graphs. Random Structures & Algorithms, 27(2):251–275, 2005.
- Dp-merf: Differentially private mean embeddings with random features for practical privacy-preserving data generation. In International conference on artificial intelligence and statistics, pages 1819–1827. PMLR, 2021.
- A simple and practical algorithm for differentially private data release. Advances in neural information processing systems, 25, 2012.
- The noisy power method: A meta algorithm with applications. Advances in neural information processing systems, 27, 2014.
- Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331–340, 2013.
- Algorithmically effective differentially private synthetic data. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 3941–3968. PMLR, 12–15 Jul 2023.
- Symmetric matrix perturbation for differentially-private principal component analysis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2339–2343. IEEE, 2016.
- A discrete analogue of the laplace distribution. Journal of statistical planning and inference, 136(3):1090–1102, 2006.
- Streaming PCA: Matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In Conference on learning theory, pages 1147–1164. PMLR, 2016.
- Wishart mechanism for differentially private principal components analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2016.
- Differential-private data publishing through component analysis. Transactions on data privacy, 6(1):19, 2013.
- Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR, 2019.
- On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395–1414. SIAM, 2013.
- Leonid V Kovalev. Lipschitz clustering in metric spaces. The Journal of Geometric Analysis, 32(7):188, 2022.
- Differentially private synthetic data using kd-trees. In The 39th Conference on Uncertainty in Artificial Intelligence, 2023.
- Tutorial: Complexity analysis of singular value decomposition and its variants. arXiv preprint arXiv:1906.12085, 2019.
- DP-PCA: Statistically optimal and differentially private PCA. arXiv preprint arXiv:2205.13709, 2022.
- Robust and differentially private mean estimation. Advances in neural information processing systems, 34:3887–3901, 2021.
- Differential privacy and robust statistics in high dimensions. In Conference on Learning Theory, pages 1167–1246. PMLR, 2022.
- Re-analyze Gauss: Bounds for private matrix approximation via Dyson Brownian motion. Advances in Neural Information Processing Systems, 35:38585–38599, 2022.
- A dynamical system perspective for Lipschitz neural networks. In International Conference on Machine Learning, pages 15484–15500. PMLR, 2022.
- Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15:267–273, 1982.
- Nonasymptotic upper bounds for the reconstruction error of PCA. The Annals of Statistics, 48(2):1098–1123, 2020.
- Privately learning subspaces. Advances in Neural Information Processing Systems, 34:1312–1324, 2021.
- Faster algorithms for privately releasing marginals. In Automata, Languages, and Programming: 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I 39, pages 810–821. Springer, 2012.
- PCPs and the hardness of generating private synthetic data. In Theory of Cryptography: 8th Theory of Cryptography Conference, TCC 2011, Providence, RI, USA, March 28-30, 2011. Proceedings 8, pages 400–416. Springer, 2011.
- Private synthetic data for multitask learning and marginal queries. In Advances in Neural Information Processing Systems, 2022.
- Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
- Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with Lipschitz functions. J. Mach. Learn. Res., 5(Jun):669–695, 2004.
- Differentially private data releasing for smooth queries. The Journal of Machine Learning Research, 17(1):1779–1820, 2016.
- Differentially private neural tangent kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687, 2023.
- A useful variant of the Davis–Kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
- Differential privacy with compression. In 2009 IEEE International Symposium on Information Theory, pages 2718–2722. IEEE, 2009.