Differentially Private Synthetic Data via Foundation Model APIs 2: Text (2403.01749v2)
Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of LLMs on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318, 2016.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023.
- Large-scale differentially private bert. arXiv preprint arXiv:2108.01624, 2021.
- Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pp. 394–403. PMLR, 2018.
- Towards private synthetic text generation. In NeurIPS 2019 Machine Learning with Guarantees Workshop, 2019.
- Differentially private bias-term only fine-tuning of foundation models. arXiv preprint arXiv:2210.00036, 2022.
- Extracting training data from large language models. USENIX Security Symposium, 2021.
- Chew, H. S. J. The use of artificial intelligence–based conversational agents (chatbots) for weight loss: scoping review and practical recommendations. JMIR Medical Informatics, 10(4):e32578, 2022.
- Measures of distance between probability distributions. Journal of mathematical analysis and applications, 138(1):280–292, 1989.
- CNN. Microsoft is bringing chatgpt technology to word, excel and outlook, 2023. URL https://www.cnn.com/2023/03/16/tech/openai-gpt-microsoft-365/index.html.
- Gaussian differential privacy. arXiv preprint arXiv:1905.02383, 2019.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Inc, Y. Yelp dataset, 2023. URL https://www.yelp.com/dataset.
- Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684, 2023.
- Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
- Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf.
- Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021.
- Large language models can be strong differentially private learners. International Conference on Learning Representations, 2022.
- Differentially private synthetic data via foundation model apis 1: Images. International Conference on Learning Representations (ICLR), 2024.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.
- Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 346–363. IEEE Computer Society, 2023.
- Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
- Differentially private language models for secure data sharing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4860–4873, 2022.
- MistralAI. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/, 2022.
- OpenAI. ChatGPT. https://chat.openai.com, 2022.
- OpenAI. Gpt-3.5 turbo fine-tuning and api updates, 2023a. URL https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023b.
- OpenAI. What are tokens and how to count them?, 2023c. URL https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828, 2021.
- Differentially private conditional text generation for synthetic data production. 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Translational psychiatry, 6(10):e921–e921, 2016.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962, 2019.
- Natural language processing: practical applications in medicine and investigation of contextual autocomplete. In Machine Learning in Clinical Neuroscience: Foundations and Applications, pp. 207–214. Springer, 2022.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
- Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning, pp. 12208–12218. PMLR, 2021.
- Differentially private fine-tuning of language models. International Conference on Learning Representations, 2022.
- Training private and efficient language models with synthetic data from llms. In NeurIPS Workshop on Socially Responsible Language Modelling Research, 2023.
- Synthetic text generation with differential privacy: A simple and practical recipe. ACL, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Chulin Xie (27 papers)
- Zinan Lin (42 papers)
- Arturs Backurs (33 papers)
- Sivakanth Gopi (37 papers)
- Da Yu (19 papers)
- Harsha Nori (24 papers)
- Haotian Jiang (43 papers)
- Huishuai Zhang (64 papers)
- Yin Tat Lee (102 papers)
- Bo Li (1107 papers)
- Sergey Yekhanin (19 papers)
- Huseyin A Inan (3 papers)