Pre-training Differentially Private Models with Limited Public Data (2402.18752v2)
Abstract: The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $\epsilon=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1)
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
- Public data-assisted mirror descent for private model training. In International Conference on Machine Learning, pages 517–535. PMLR, 2022.
- Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014.
- signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
- Unlocking accuracy and fairness in differentially private image classification. arXiv preprint arXiv:2308.10888, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Deep learning with gaussian differential privacy. Harvard data science review, 2020(23), 2020.
- On the accuracy and efficiency of group-wise clipping in differentially private optimization. arXiv preprint arXiv:2310.19215, 2023.
- Scalable and efficient training of large convolutional neural networks with differential privacy. Advances in Neural Information Processing Systems, 35:38305–38318, 2022.
- On the convergence and calibration of deep learning with differential privacy. Transactions on Machine Learning Research, 2023.
- Automatic clipping: Differentially private deep learning made easier and stronger. arXiv preprint arXiv:2206.07136, 2022.
- Differentially private bias-term only fine-tuning of foundation models. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022, 2022.
- Differentially private optimization on large model at small cost. In International Conference on Machine Learning, pages 3192–3218. PMLR, 2023.
- Differentially private optimizers can learn adversarially robust models. Transactions on Machine Learning Research, 2023.
- Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022.
- Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Differential privacy protection against membership inference attack on machine learning for genomic data. In BIOCOMPUTING 2021: Proceedings of the Pacific Symposium, pages 26–37. World Scientific, 2020.
- Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 33:13773–13782, 2020.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- On the convergence of differentially private federated learning on non-lipschitz objectives, and with normalized client updates. arXiv preprint arXiv:2106.07094, 2021.
- Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Gaussian differential privacy. arXiv preprint arXiv:1905.02383, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Self-training improves pre-training for natural language understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5408–5418, 2021.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- Combining public and private data. arXiv preprint arXiv:2111.00115, 2021.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
- Mixed differential privacy in computer vision. arXiv preprint arXiv:2203.11481, 2022.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34, 2021.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020.
- Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
- Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Learning and evaluating a differentially private pre-trained language model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1178–1189, 2021.
- Daniel Huynh. Starcoder memorization experiment highlights privacy risks of fine-tuning on code. https://huggingface.co/blog/dhuynh95/starcoder-memorization-experiment, 2023.
- Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405, 2021.
- Domain-specific continued pretraining of language models for capturing long context in mental health. arXiv preprint arXiv:2304.10447, 2023.
- Conservative or liberal? personalized differential privacy. In 2015 IEEE 31St international conference on data engineering, pages 1023–1034. IEEE, 2015.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Toward training at imagenet scale with differential privacy. arXiv preprint arXiv:2201.12328, 2022.
- When does differentially private learning not suffer in high dimensions? Advances in Neural Information Processing Systems, 35:28616–28630, 2022.
- Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning, pages 22188–22214. PMLR, 2023.
- Coupling public and private gradient provably helps optimization. arXiv preprint arXiv:2310.01304, 2023.
- Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 346–363. IEEE Computer Society, 2023.
- Dimension independent generalization of dp-sgd for overparameterized smooth convex optimization. arXiv preprint arXiv:2206.01836, 2022.
- Danilo P Mandic. A generalized normalized gradient descent algorithm. IEEE signal processing letters, 11(2):115–118, 2004.
- Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35, 2017.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
- Large scale transfer learning for differentially private image classification. arXiv preprint arXiv:2205.02973, 2022.
- Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017.
- Lit tuned models for efficient species detection. In 2nd AAAI Workshop on AI for Agriculture and Food Systems, 2023.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Chatgpt spit out sensitive data when told to repeat ’poem’ forever. https://www.wired.com/story/chatgpt-poem-forever-security-roundup/, 2023.
- Tempered sigmoid activations for deep learning with differential privacy.
- Bleu: a method for automatic evaluation of machine translation. pages 311–318, 2002.
- Automatic differentiation in pytorch. 2017.
- How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research, 77:1113–1201, 2023.
- Katyanna Quach. Inside the 1tb imagenet data set used to train the world’s ai: Naked kids, drunken frat parties, porno stars, and more, Oct 2019.
- Language models are unsupervised multitask learners.
- Membership inference attack against differentially private deep learning model. Trans. Data Priv., 11(1):61–79, 2018.
- Imagenet-21k pretraining for the masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Tan without a burn: Scaling laws of dp-sgd. In International Conference on Machine Learning, pages 29937–29949. PMLR, 2023.
- Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training. Information Systems, 106:101718, 2022.
- Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics, pages 2638–2646. PMLR, 2021.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Differentially private learning needs better features (or much more data). arXiv preprint arXiv:2011.11660, 2020.
- Considerations for differentially private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470, 2022.
- Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12884–12893, 2021.
- Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pages 6526–6535. PMLR, 2019.
- Di Wang and Jinhui Xu. Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1182–1189, 2019.
- Extending multilingual bert to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2649–2656, 2020.
- Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2020.
- Normalized/clipped sgd with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033, 2022.
- Initialization matters: Privacy-utility analysis of overparameterized neural networks. arXiv preprint arXiv:2310.20579, 2023.
- Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2019.
- Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
- Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021.
- Vip: A differentially private foundation model for computer vision. arXiv preprint arXiv:2306.08842, 2023.
- Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
- Understanding clipping for federated learning: Convergence and client-level differential privacy. In International Conference on Machine Learning, ICML 2022, 2022.
- Bypassing the ambient dimension: Private sgd with gradient subspace identification. In International Conference on Learning Representations, 2020.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pages 7654–7663. PMLR, 2019.