NeuroMixGDP: A Neural Collapse-Inspired Random Mixup for Private Data Release (2202.06467v2)
Abstract: Privacy-preserving data release algorithms have gained increasing attention for their ability to protect user privacy while enabling downstream machine learning tasks. However, the utility of current popular algorithms is not always satisfactory. Mixup of raw data provides a new way of data augmentation, which can help improve utility. However, its performance drastically deteriorates when differential privacy (DP) noise is added. To address this issue, this paper draws inspiration from the recently observed Neural Collapse (NC) phenomenon, which states that the last layer features of a neural network concentrate on the vertices of a simplex as Equiangular Tight Frame (ETF). We propose a scheme to mixup the Neural Collapse features to exploit the ETF simplex structure and release noisy mixed features to enhance the utility of the released data. By using Gaussian Differential Privacy (GDP), we obtain an asymptotic rate for the optimal mixup degree. To further enhance the utility and address the label collapse issue when the mixup degree is large, we propose a Hierarchical sampling method to stratify the mixup samples on a small number of classes. This method remarkably improves utility when the number of classes is large. Extensive experiments demonstrate the effectiveness of our proposed method in protecting against attacks and improving utility. In particular, our approach shows significantly improved utility compared to directly training classification networks with DPSGD on CIFAR100 and MiniImagenet datasets, highlighting the benefits of using privacy-preserving data release. We release reproducible code in https://github.com/Lidonghao1996/NeuroMixGDP.
- Dppro: Differentially private high-dimensional data release via random projection. IEEE Transactions on Information Forensics and Security, 12(12):3081–3093, 2017. doi: 10.1109/TIFS.2017.2737966.
- Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst., 42(4), oct 2017a. ISSN 0362-5915. doi: 10.1145/3134428. URL https://doi.org/10.1145/3134428.
- Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
- PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1zk9iRqF7.
- P3gm: Private high-dimensional data release via privacy preserving phased generative model. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 169–180. IEEE, 2021.
- Synthesizing differentially private datasets using random mixing. In 2019 IEEE International Symposium on Information Theory (ISIT), page 542–546. IEEE, Jul 2019. ISBN 978-1-5386-9291-2. doi: 10.1109/ISIT.2019.8849381. URL https://ieeexplore.ieee.org/document/8849381/.
- Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017b.
- Towards understanding the data dependency of mixup-style training, 2021.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- BNP Paribas Cardif. Bnp paribas cardif claims management, 2016. URL https://www.kaggle.com/competitions/bnp-paribas-cardif-claims-management/data.
- Jane Street Group. Jane street market prediction, 2021. URL https://www.kaggle.com/competitions/jane-street-market-prediction/data.
- Ubiquant. Ubiquant market prediction, 2022. URL https://www.kaggle.com/competitions/ubiquant-market-prediction/data.
- American Express. American express - default prediction, 2022. URL https://www.kaggle.com/competitions/amex-default-prediction/data.
- InVitro Cell Research. Icr - identifying age-related conditions, 2023. URL https://www.kaggle.com/competitions/icr-identify-age-related-conditions/data.
- Gaussian differential privacy. Journal of the Royal Statistical Society, Series B, 00:1–35, 2021. (with discussion).
- Differentially private learning needs better features (or much more data). In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YTWGvpFOQD-.
- Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Learning multiple layers of features from tiny images. 2009.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Gs-wgan: A gradient-sanitized approach for learning differentially private generators. Advances in Neural Information Processing Systems, 33:12673–12684, 2020a.
- G-PATE: Scalable differentially private data generator via private aggregation of teacher discriminators. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=_CmrI7UrmCl.
- Datalens: Scalable privacy preserving training via gradient compression and aggregation. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 2146–2168, 2021.
- Don’t generate me: Training differentially private generative models with sinkhorn divergence. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 12480–12492. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/67ed94744426295f96268f4ac1881b46-Paper.pdf.
- Dp-merf: Differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation. In International conference on artificial intelligence and statistics, pages 1819–1827. PMLR, 2021.
- Pearl: Data synthesis via private embeddings and adversarial reconstruction learning. In International Conference on Learning Representations, 2021.
- Hermite polynomial features for private data generation. In International Conference on Machine Learning, pages 22300–22324. PMLR, 2022.
- Differentially private data generation needs better features. Transactions on Machine Learning Research, 2023.
- Tempered sigmoid activations for deep learning with differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9312–9321, 2021.
- Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, pages 148–162, 2019.
- Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018.
- Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models, 2018.
- Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, pages 265–284, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-32732-5.
- Winning the NIST contest: A scalable and general approach to differentially private synthetic data. CoRR, abs/2108.04978, 2021. URL https://arxiv.org/abs/2108.04978.
- Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594, 2018.
- OpenAI. New and improved embedding model, 2022. URL https://openai.com/blog/new-and-improved-embedding-model.
- Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv preprint arXiv:2106.02073, 2021.
- On the role of neural collapse in transfer learning. In International Conference on Learning Representations, 2022.
- A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
- An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2021.
- On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022a.
- Are all losses created equal: A neural collapse perspective. Advances in Neural Information Processing Systems, 35:31697–31710, 2022b.
- Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
- Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pages 21478–21505. PMLR, 2022.
- A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.
- Unlocking high-accuracy differentially private image classification through scale, 2022.
- A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375–389, 2010.
- Poission subsampled rényi differential privacy. In International Conference on Machine Learning, pages 7634–7642. PMLR, 2019.
- Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.
- Deep learning with gaussian differential privacy. Harvard data science review, 2020(23), 2020.
- On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, pages 32–33, 2012.
- Ilya Mironov. Rényi differential privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Aug 2017. doi: 10.1109/csf.2017.11. URL http://dx.doi.org/10.1109/CSF.2017.11.
- Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability, 21(3):1275–1294, 1993. ISSN 00911798. URL http://www.jstor.org/stable/2244575.
- Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33:22243–22255, 2020b.
- Deep roto-translation scattering for object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2865–2873, 2015.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, page 308–318, Oct 2016. doi: 10.1145/2976749.2978318. arXiv: 1607.00133.
- Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- Adam: A method for stochastic optimization. In ICLR (Poster), 2015.