Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Harnessing large-language models to generate private synthetic text (2306.01684v2)

Published 2 Jun 2023 in cs.LG and cs.CR

Abstract: Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. An alternative approach, which this paper studies, is to use a sensitive dataset to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data. Doing so has several advantages: synthetic data can be reused for other tasks (including for hyper parameter tuning), retained indefinitely, and shared with third parties without sacrificing privacy. However, generating private synthetic data is much harder than training a private model. To improve performance on text data, recent work has utilized public data by starting with a pre-trained generative LLM and privately fine-tuning it on sensitive data. This model can be used to sample a DP synthetic dataset. While this strategy seems straightforward, executing it has proven problematic. Previous approaches either show significant performance loss, or have, as we show, critical design flaws. In this paper we demonstrate that a proper training objective along with tuning fewer parameters results in excellent DP synthetic data quality. Our approach is competitive with direct DP-training of downstream classifiers in terms of performance on downstream tasks. Further, we demonstrate that our DP synthetic data is not only useful for downstream classifier training, but also to tune those same models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, oct 2016. doi: 10.1145/2976749.2978318. URL https://doi.org/10.1145%2F2976749.2978318.
  2. Private empirical risk minimization: Efficient algorithms and tight error bounds. Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, pages 464–473, 12 2014. doi: 10.1109/FOCS.2014.56.
  3. Practical locally private heavy hitters. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3d779cae2d46cf6a8a99a35ba4167977-Paper.pdf.
  4. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022. URL https://arxiv.org/abs/2204.06745.
  5. A learning theory approach to non-interactive database privacy. CoRR, abs/1109.2229, 2011. URL http://arxiv.org/abs/1109.2229.
  6. Towards private synthetic text generation. In NeurIPS 2019 Machine Learning with Guarantees Workshop, 2019.
  7. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  8. Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’14, page 1–10, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450327107. doi: 10.1145/2591796.2591877. URL https://doi.org/10.1145/2591796.2591877.
  9. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
  10. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018a. URL http://arxiv.org/abs/1810.04805.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018b.
  13. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology–EUROCRYPT, pages 486–503, 2006a.
  14. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pages 265–284, 2006b. URL http://dx.doi.org/10.1007/11681878_14.
  15. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. Generation of synthetic electronic medical record text. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 374–380, 2018. doi: 10.1109/BIBM.2018.8621223.
  17. A simple and practical algorithm for differentially private data release. CoRR, abs/1012.4763, 2010. URL http://arxiv.org/abs/1012.4763.
  18. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
  19. Semi-supervised convolutional neural networks for text categorization via region embedding. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/acc3e0404646c57502b480dc052c4fe1-Paper.pdf.
  20. Measuring the measuring tools: An automatic evaluation of semantic metrics for text corpora. arXiv preprint arXiv:2211.16259, 2022.
  21. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2022.
  22. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  23. Large language models can be strong differentially private learners. CoRR, abs/2110.05679, 2021. URL https://arxiv.org/abs/2110.05679.
  24. Iterative methods for private synthetic data: Unifying framework and new methods. CoRR, abs/2106.07153, 2021. URL https://arxiv.org/abs/2106.07153.
  25. Learning word vectors for sentiment analysis, 2011. URL https://www.tensorflow.org/datasets/catalog/imdb_reviews.
  26. Differentially private language models for secure data sharing. arXiv preprint arXiv:2210.13918, 2022.
  27. Large scale transfer learning for differentially private image classification, 2022.
  28. Towards automatic generation of shareable synthetic clinical notes using neural language models. CoRR, abs/1905.07002, 2019. URL http://arxiv.org/abs/1905.07002.
  29. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  30. Openwebtext dataset, 2019. URL https://github.com/jcpeterson/openwebtext#openwebtext.
  31. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828, 2021.
  32. Training text-to-text transformers with privacy guarantees. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2182–2193, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.171. URL https://aclanthology.org/2022.findings-acl.171.
  33. How to dp-fy ml: A practical guide to machine learning with differential privacy, 2023.
  34. Differentially private conditional text generation for synthetic data production, 2023. URL https://openreview.net/forum?id=LUql3ZOFwFD.
  35. Gpt-2 model card, "language models are unsupervised multitask learners", 2019. URL https://github.com/openai/gpt-2/blob/master/model_card.md.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  38. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022. URL https://arxiv.org/abs/2203.17189.
  39. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235.
  40. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346. URL https://aclanthology.org/2020.emnlp-main.346.
  41. Benchmarking differentially private synthetic data generation algorithms. CoRR, abs/2112.09238, 2021. URL https://arxiv.org/abs/2112.09238.
  42. Apple Differential Privacy Team. Learning with privacy at scale. In Available from https://machinelearning.apple.com/research/learning-with-privacy-at-scale, 2017.
  43. Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
  44. Differentially private synthetic medical data generation using convolutional gans. CoRR, abs/2012.11774, 2020. URL https://arxiv.org/abs/2012.11774.
  45. Laurens van der Maaten and Awni Y. Hannun. The trade-offs of private prediction. CoRR, abs/2007.05089, 2020. URL https://arxiv.org/abs/2007.05089.
  46. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  47. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium, 2017.
  48. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst., 28(4), nov 2010. ISSN 1046-8188. doi: 10.1145/1852102.1852106. URL https://doi.org/10.1145/1852102.1852106.
  49. Differentially private generative adversarial network. CoRR, abs/1802.06739, 2018. URL http://arxiv.org/abs/1802.06739.
  50. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1zk9iRqF7.
  51. Differentially private fine-tuning of language models. CoRR, abs/2110.06500, 2021. URL https://arxiv.org/abs/2110.06500.
  52. Synthetic text generation with differential privacy: A simple and practical recipe, 2022. URL https://arxiv.org/abs/2210.14348.
  53. Character-level Convolutional Networks for Text Classification . arXiv:1509.01626 [cs], September 2015a. URL https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews.
  54. Character-level convolutional networks for text classification. CoRR, abs/1509.01626, 2015b. URL http://arxiv.org/abs/1509.01626.
  55. Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017–5033, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.398. URL https://aclanthology.org/2021.naacl-main.398.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexey Kurakin (19 papers)
  2. Natalia Ponomareva (22 papers)
  3. Umar Syed (19 papers)
  4. Liam MacDermed (1 paper)
  5. Andreas Terzis (23 papers)
Citations (24)
X Twitter Logo Streamline Icon: https://streamlinehq.com