Scalable Extraction of Training Data from (Production) Language Models (2311.17035v1)
Abstract: This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source LLMs like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
- Andersson, O. Sequential Good-Turing and the missing species problem.
- PaLM 2 Technical Report, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
- Reconstructing training data with informed adversaries. In IEEE S&P (2022).
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, 2023.
- GPT-Neo: Large scale autoregressive language modeling with Mesh-Tensorflow, 2021.
- What does it mean for a language model to preserve privacy? In ACM FAccT (2022).
- Language models are few-shot learners. In NeurIPS (2020).
- Membership inference attacks from first principles. In IEEE Symposium on Security and Privacy (2022), IEEE.
- Extracting training data from diffusion models. In USENIX Security Symposium (2023).
- Quantifying memorization across neural language models. In ICLR (2023).
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (2019).
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447 (2023).
- Extracting training data from large language models. In USENIX Security Symposium (2021).
- Chao, A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265–270.
- An improved nonparametric lower bound of species richness via a modified good–turing frequency formula. Biometrics 70, 3 (2014), 671–682.
- Label-only membership inference attacks. In International conference on machine learning (2021), PMLR, pp. 1964–1974.
- Deep reinforcement learning from human preferences. NeurIPS (2017).
- Computer, T. RedPajama: An open source recipe to reproduce LLaMA training dataset, 2023.
- Computer, T. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models, 2023.
- Model inversion attacks that exploit confidence information and basic countermeasures. In ACM Conference on Computer and Communications Security (CCS) (2015).
- Good-Turing frequency estimation without tears. Journal of quantitative linguistics 2, 3 (1995), 217–237.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
- Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3-4 (1953), 237–264.
- Training compute-optimal large language models. In NeurIPS (2022).
- An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems 35 (2022), 30016–30030.
- Ishihara, S. Training data extraction from pre-trained language models: A survey, 2023.
- Mistral 7b, 2023.
- Deduplicating training data mitigates privacy risks in language models. ICML (2022).
- Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662 (2023).
- Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain, 2023.
- AI and Law: The Next Generation, 2023.
- Deduplicating training data makes language models better. In ACL (2022).
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 (2023).
- OpenAI. ChatGPT: Optimizing Language Models for Dialogue, 2022.
- OpenAI. Custom instructions for ChatGPT, 2023.
- OpenAI. GPT-4 System Card. Tech. rep., Mar. 2023.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Training language models to follow instructions with human feedback. NeurIPS (2022).
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, 2023.
- Project Zero. Vulnerability disclosure policy. https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html, 2021.
- Language Models are Unsupervised Multitask Learners. Tech. rep., OpenAI, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
- Multitask prompted training enables zero-shot task generalization. In ICLR (2021).
- Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (2017).
- Soldaini, L. AI2 Dolma: 3 trillion token open corpus for language model pretraining, 2023.
- Diffusion art or digital forgery? Investigating data replication in diffusion models. In CVPR (2023).
- Ecological methods. John Wiley & Sons, 2009.
- LLaMA: Open and Efficient Foundation Language Models, 2023.
- LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- TTI. Introducing Falcon 180b.
- Privacy risk in machine learning: Analyzing the connection to overfitting. In IEEE CSF (2018).
- Zelterman, D. Smooth nonparametric estimation of the quantile function. Journal of statistical planning and inference 26, 3 (1990), 339–352.
- Opt: Open pre-trained transformer language models, 2022.
- Ziegler, A. Github Copilot research recitation, 2021.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
- Milad Nasr (48 papers)
- Nicholas Carlini (101 papers)
- Jonathan Hayase (20 papers)
- Matthew Jagielski (51 papers)
- A. Feder Cooper (32 papers)
- Daphne Ippolito (47 papers)
- Christopher A. Choquette-Choo (49 papers)
- Eric Wallace (42 papers)
- Florian Tramèr (87 papers)
- Katherine Lee (34 papers)