Scalable Extraction of Training Data from (Production) Language Models (2311.17035v1)

Published 28 Nov 2023 in cs.LG, cs.CL, and cs.CR

Abstract: This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source LLMs like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

References (56)

Authors (10)

Milad Nasr (48 papers)
Nicholas Carlini (101 papers)
Jonathan Hayase (20 papers)
Matthew Jagielski (51 papers)
A. Feder Cooper (32 papers)
Daphne Ippolito (47 papers)
Christopher A. Choquette-Choo (49 papers)
Eric Wallace (42 papers)
Florian Tramèr (87 papers)
Katherine Lee (34 papers)

Citations (251)

View on Semantic Scholar

Summary

The paper investigates the extent to which training data can be extracted from LLMs, including both open-source and closed models. The authors introduce a scalable methodology to detect extractable memorization, where an adversary can efficiently recover training data by querying a machine learning model.

The authors make the following claims:

Larger and more capable models are more vulnerable to data extraction attacks.
Existing techniques are sufficient to attack unaligned models.
Current alignment techniques do not eliminate memorization, and they developed a divergence attack to extract data from aligned models like ChatGPT.

Here's a breakdown of their approach and findings:

Extractable Memorization Definition: The paper defines extractable memorization as the ability of an adversary to construct a prompt $x$ that makes a model with generation routine $\mathsf{Gen}$ produce a string $s$ from the training set $D$ , such that $\mathsf{Gen}(x) = s$ .

$s \in D$ : An example $s$ from the training set $D$ .

$\mathsf{Gen}$ : A generation routine.

$x$ : A prompt.
Methodology for Open Models: For open-source models, the authors followed the data extraction attack method of prompting the model with short strings sampled from Wikipedia. They then used a suffix array data structure to efficiently check whether the model's output was present in the training dataset. A suffix array $SA(D)$ over a dataset $D$ stores all suffixes of the dataset in sorted order, enabling fast string lookups.

$SA(D)$ : Suffix array of dataset $D$ .
Methodology for Semi-Closed Models: For semi-closed models (publicly available parameters, but training datasets are not known), the authors built an auxiliary dataset (AuxDataset) of Internet text and checked if the model's outputs were contained within it. The AuxDataset comprised 9TB of text from The Pile, RefinedWeb, RedPajama, and Dolma datasets.
Divergence Attack on ChatGPT: To circumvent the alignment of ChatGPT, the authors discovered a prompting strategy that causes the model to "diverge" from reasonable, chatbot-style generations. This involves repeating a word many times, which eventually leads the model to output text in a typical Internet-text style, thus revealing memorized training data.
Quantifying Total Memorization: The paper uses the Good-Turing estimator to predict the probability that a fresh sample will yield a novel string to estimate the total amount of extractable memorization of a LLM.
Main Results:
- Open-source models exhibited memorization rates between 0.1% and 1%.
- The attack on ChatGPT allowed them to extract over 10,000 unique verbatim-memorized training examples with a \$200 USD budget.
- The most effective words in the divergence attack were over 100x more effective at recovering memorized output than the least effective words.
- A manual analysis revealed that 16.9% of generations contained memorized Personally Identifiable Information (PII).
- The likelihood ratio between model perplexity and zlib compression entropy was effective at distinguishing memorized training data from non-memorized data.
Discoverable vs. Extractable Memorization: The paper found that data known to be memorized could not be reliably detected using discoverable memorization techniques, suggesting limitations in red-teaming aligned models.
ChatGPT Vulnerability: The paper speculates that ChatGPT's vulnerability may stem from pre-training for many epochs, the instability of repeating a single token, and the possibility that word repetition simulates the end-of-text token, causing the model to "reset."
Ethical Considerations: The authors responsibly disclosed their findings to the relevant model developers and allowed time for mitigation before publishing the paper.

In conclusion, the paper demonstrates the ease with which training data can be extracted from state-of-the-art LLMs, highlighting the challenges in mitigating memorization and the limitations of current alignment techniques.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jillnephew/status/1777201212464730432

https://twitter.com/GaryMarcus/status/1754505631250481268

https://twitter.com/maksym_andr/status/1835026293844590670

https://twitter.com/bobbydigitales/status/1806812657506529361

https://twitter.com/GaryMarcus/status/1754509408875483394

https://twitter.com/cwolferesearch/status/1750603254935715858

Scalable Extraction of Training Data from (Production) Language Models (2311.17035v1)

Summary

Related Papers

Tweets

YouTube

HackerNews

Reddit