Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable Extraction of Training Data from (Production) Language Models (2311.17035v1)

Published 28 Nov 2023 in cs.LG, cs.CL, and cs.CR

Abstract: This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source LLMs like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Andersson, O. Sequential Good-Turing and the missing species problem.
  2. PaLM 2 Technical Report, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
  4. Reconstructing training data with informed adversaries. In IEEE S&P (2022).
  5. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, 2023.
  6. GPT-Neo: Large scale autoregressive language modeling with Mesh-Tensorflow, 2021.
  7. What does it mean for a language model to preserve privacy? In ACM FAccT (2022).
  8. Language models are few-shot learners. In NeurIPS (2020).
  9. Membership inference attacks from first principles. In IEEE Symposium on Security and Privacy (2022), IEEE.
  10. Extracting training data from diffusion models. In USENIX Security Symposium (2023).
  11. Quantifying memorization across neural language models. In ICLR (2023).
  12. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (2019).
  13. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447 (2023).
  14. Extracting training data from large language models. In USENIX Security Symposium (2021).
  15. Chao, A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics (1984), 265–270.
  16. An improved nonparametric lower bound of species richness via a modified good–turing frequency formula. Biometrics 70, 3 (2014), 671–682.
  17. Label-only membership inference attacks. In International conference on machine learning (2021), PMLR, pp. 1964–1974.
  18. Deep reinforcement learning from human preferences. NeurIPS (2017).
  19. Computer, T. RedPajama: An open source recipe to reproduce LLaMA training dataset, 2023.
  20. Computer, T. Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models, 2023.
  21. Model inversion attacks that exploit confidence information and basic countermeasures. In ACM Conference on Computer and Communications Security (CCS) (2015).
  22. Good-Turing frequency estimation without tears. Journal of quantitative linguistics 2, 3 (1995), 217–237.
  23. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  24. Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3-4 (1953), 237–264.
  25. Training compute-optimal large language models. In NeurIPS (2022).
  26. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems 35 (2022), 30016–30030.
  27. Ishihara, S. Training data extraction from pre-trained language models: A survey, 2023.
  28. Mistral 7b, 2023.
  29. Deduplicating training data mitigates privacy risks in language models. ICML (2022).
  30. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662 (2023).
  31. Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain, 2023.
  32. AI and Law: The Next Generation, 2023.
  33. Deduplicating training data makes language models better. In ACL (2022).
  34. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 (2023).
  35. OpenAI. ChatGPT: Optimizing Language Models for Dialogue, 2022.
  36. OpenAI. Custom instructions for ChatGPT, 2023.
  37. OpenAI. GPT-4 System Card. Tech. rep., Mar. 2023.
  38. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  39. Training language models to follow instructions with human feedback. NeurIPS (2022).
  40. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, 2023.
  41. Project Zero. Vulnerability disclosure policy. https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html, 2021.
  42. Language Models are Unsupervised Multitask Learners. Tech. rep., OpenAI, 2019.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
  44. Multitask prompted training enables zero-shot task generalization. In ICLR (2021).
  45. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (2017).
  46. Soldaini, L. AI2 Dolma: 3 trillion token open corpus for language model pretraining, 2023.
  47. Diffusion art or digital forgery? Investigating data replication in diffusion models. In CVPR (2023).
  48. Ecological methods. John Wiley & Sons, 2009.
  49. LLaMA: Open and Efficient Foundation Language Models, 2023.
  50. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  51. TTI. Introducing Falcon 180b.
  52. Privacy risk in machine learning: Analyzing the connection to overfitting. In IEEE CSF (2018).
  53. Zelterman, D. Smooth nonparametric estimation of the quantile function. Journal of statistical planning and inference 26, 3 (1990), 339–352.
  54. Opt: Open pre-trained transformer language models, 2022.
  55. Ziegler, A. Github Copilot research recitation, 2021.
  56. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Milad Nasr (48 papers)
  2. Nicholas Carlini (101 papers)
  3. Jonathan Hayase (20 papers)
  4. Matthew Jagielski (51 papers)
  5. A. Feder Cooper (32 papers)
  6. Daphne Ippolito (47 papers)
  7. Christopher A. Choquette-Choo (49 papers)
  8. Eric Wallace (42 papers)
  9. Florian Tramèr (87 papers)
  10. Katherine Lee (34 papers)
Citations (251)

Summary

The paper investigates the extent to which training data can be extracted from LLMs, including both open-source and closed models. The authors introduce a scalable methodology to detect extractable memorization, where an adversary can efficiently recover training data by querying a machine learning model.

The authors make the following claims:

  • Larger and more capable models are more vulnerable to data extraction attacks.
  • Existing techniques are sufficient to attack unaligned models.
  • Current alignment techniques do not eliminate memorization, and they developed a divergence attack to extract data from aligned models like ChatGPT.

Here's a breakdown of their approach and findings:

  • Extractable Memorization Definition: The paper defines extractable memorization as the ability of an adversary to construct a prompt xx that makes a model with generation routine Gen\mathsf{Gen} produce a string ss from the training set DD, such that Gen(x)=s\mathsf{Gen}(x) = s.

    sDs \in D: An example ss from the training set DD.

    Gen\mathsf{Gen}: A generation routine.

    xx: A prompt.

  • Methodology for Open Models: For open-source models, the authors followed the data extraction attack method of prompting the model with short strings sampled from Wikipedia. They then used a suffix array data structure to efficiently check whether the model's output was present in the training dataset. A suffix array SA(D)SA(D) over a dataset DD stores all suffixes of the dataset in sorted order, enabling fast string lookups.

    SA(D)SA(D): Suffix array of dataset DD.

  • Methodology for Semi-Closed Models: For semi-closed models (publicly available parameters, but training datasets are not known), the authors built an auxiliary dataset (AuxDataset) of Internet text and checked if the model's outputs were contained within it. The AuxDataset comprised 9TB of text from The Pile, RefinedWeb, RedPajama, and Dolma datasets.
  • Divergence Attack on ChatGPT: To circumvent the alignment of ChatGPT, the authors discovered a prompting strategy that causes the model to "diverge" from reasonable, chatbot-style generations. This involves repeating a word many times, which eventually leads the model to output text in a typical Internet-text style, thus revealing memorized training data.
  • Quantifying Total Memorization: The paper uses the Good-Turing estimator to predict the probability that a fresh sample will yield a novel string to estimate the total amount of extractable memorization of a LLM.
  • Main Results:
    • Open-source models exhibited memorization rates between 0.1% and 1%.
    • The attack on ChatGPT allowed them to extract over 10,000 unique verbatim-memorized training examples with a \$200 USD budget.
    • The most effective words in the divergence attack were over 100x more effective at recovering memorized output than the least effective words.
    • A manual analysis revealed that 16.9% of generations contained memorized Personally Identifiable Information (PII).
    • The likelihood ratio between model perplexity and zlib compression entropy was effective at distinguishing memorized training data from non-memorized data.
  • Discoverable vs. Extractable Memorization: The paper found that data known to be memorized could not be reliably detected using discoverable memorization techniques, suggesting limitations in red-teaming aligned models.
  • ChatGPT Vulnerability: The paper speculates that ChatGPT's vulnerability may stem from pre-training for many epochs, the instability of repeating a single token, and the possibility that word repetition simulates the end-of-text token, causing the model to "reset."
  • Ethical Considerations: The authors responsibly disclosed their findings to the relevant model developers and allowed time for mitigation before publishing the paper.

In conclusion, the paper demonstrates the ease with which training data can be extracted from state-of-the-art LLMs, highlighting the challenges in mitigating memorization and the limitations of current alignment techniques.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com