Extracting Training Data from Large Language Models (2012.07805v2)

Published 14 Dec 2020 in cs.CR, cs.CL, and cs.LG

Abstract: It has become common to publish large (billion parameter) LLMs that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the LLM. We demonstrate our attack on GPT-2, a LLM trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training LLMs.

Authors (12)

Nicholas Carlini (101 papers)
Eric Wallace (42 papers)
Matthew Jagielski (51 papers)
Ariel Herbert-Voss (8 papers)
Katherine Lee (34 papers)
Adam Roberts (46 papers)
Tom Brown (74 papers)
Dawn Song (229 papers)
Alina Oprea (56 papers)
Colin Raffel (83 papers)
Florian Tramer (19 papers)
Ulfar Erlingsson (2 papers)

Citations (1,551)

View on Semantic Scholar

Summary

Extracting Training Data from LLMs

In their paper "Extracting Training Data from LLMs," Nicholas Carlini et al. investigate the potential for LLMs to memorize and inadvertently disclose training data, exploring the implications for privacy and security. The research reveals that LLMs, such as OpenAI’s GPT-2, can be effectively attacked to extract training data that includes personally identifiable information (PII), snippets from IRC conversations, software code, and other forms of sensitive content—even when such data appears only once in the training set.

Objectives and Approach

The authors aim to understand the memorization behavior of modern large-scale LMs, identify what kinds of data are memorized, quantify the extent of memorization, and propose mitigation strategies to safeguard sensitive data. The paper focuses on GPT-2, a Transformer-based LM with 1.5 billion parameters, trained on diverse public internet data.

The authors introduce an extraction attack framework composed of two primary steps:

Text Generation: They generate numerous text samples from the model using various strategies, such as top- $n$ sampling, temperature decay sampling, and seeding with internet-scraped text.
Membership Inference: They apply metrics to identify and filter out samples likely to contain memorized training data. Metrics include perplexity, ratio-based comparison with smaller models, zlib compression entropy, and comparisons with canonicalized versions of text.

Key Findings

The researchers successfully extracted 604 distinct memorized samples by probing the model, including sensitive data. The breakdown of memorized content includes news headlines, logs, personal information, internet addresses, and code snippets. Notably, certain high-entropy sequences (e.g., UUIDs or base64 data) were discovered to be memorized verbatim.

Numerical Insights

Some significant findings from the paper include:

A strong positive correlation between model size and extent of memorization. The larger the model, the more data it memorizes.
For instance, GPT-2 XL (1.5 billion parameters) memorized many more sequences compared to smaller models like GPT-2 Medium (345 million parameters) and GPT-2 Small (124 million parameters).
Memorized sequences included high-entropy data such as 87-character long passwords, which were present exactly once in the training data, yet still extractable.
Only a few repetitions (as few as 33 instances) within the training data were sufficient for a sequence to be memorized by the model.

Practical and Theoretical Implications

The practical implications of these findings are considerable. The ability of an attacker to extract sensitive data from LLMs has profound consequences for applications that utilize such models in user-facing environments, such as chatbots and auto-complete tools. Extracted data can lead to privacy breaches, data leakage of confidential information, and violations of user trust.

Theoretically, this research challenges the presumption that avoiding overfitting (i.e., maintaining a minimal train-test gap) inherently prevents memorization. The results indicate that even models with minimal overfitting can retain and expose training data under specific prompting conditions.

Future Developments

The research recommends multiple strategies to mitigate privacy risks:

Differential Privacy (DP): Adopting DP in the training process can provide formal guarantees against data leakage but may also degrade model performance and increase training complexity.
Enhanced Data Sanitization: Proactively identifying and removing sensitive information prior to training can reduce risk, though it may not fully prevent leakage.
Robust De-duplication: Implementing sophisticated de-duplication techniques at finer granularity to limit redundant data entries.
Auditing and Verifiable Models: Regularly auditing models post-training for potential memorization issues, ensuring stronger privacy enforcement.

Conclusion

Carlini et al.'s paper exposes critical vulnerabilities in current LLM architectures and training methodologies, urging a reassessment of privacy assurances in models handling sensitive data. Given the rapid growth in model sizes and capabilities, addressing these privacy risks is essential for the responsible deployment of AI technologies in real-world applications.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/domosauce/status/1744743831231447145

https://twitter.com/thepatnorris/status/1895181337600106862

https://twitter.com/rnikhilcom/status/1744425522484900028

https://twitter.com/geeknik/status/1835803626763084180

https://twitter.com/jimnasyum/status/1901175644329046451

YouTube

Show All Videos