- The paper introduces the novel Min‑k method to detect pretraining data with a higher AUC than traditional membership inference attacks.
- It leverages the average log‑likelihood of the lowest probability tokens to distinguish between seen and unseen text.
- Case studies validate its effectiveness in identifying copyrighted material, dataset contamination, and shortcomings in machine unlearning.
This paper addresses the critical issue of opacity surrounding the pretraining data of LLMs. Given the vast scale of this data, it is highly likely to contain problematic content like copyrighted material, personally identifiable information, and benchmark data, yet its composition is rarely disclosed. The authors paper the pretraining data detection problem: determining if a given text was included in an LLM's pretraining data using only black-box access to the model.
Existing Membership Inference Attack (MIA) methods, primarily developed for fine-tuning data, face significant challenges when applied to LLM pretraining:
- Unavailability of Pretraining Data Distribution: These methods often rely on reference models trained on similar data, which is impractical for LLMs due to the proprietary nature and massive scale of pretraining datasets.
- Detection Difficulty: Pretraining involves much larger datasets and often single-epoch exposure, reducing memorization compared to multi-epoch fine-tuning. Theoretical evidence suggests detection is harder with larger datasets, lower learning rates, and less frequent data occurrence.
To facilitate research into this problem, the authors introduce WikiMIA, a dynamic benchmark. WikiMIA leverages the temporal nature of Wikipedia event data, using events added before model training as "member" data and events added after training as "non-member" data. This approach ensures the non-member data is truly unseen. The benchmark supports evaluating detection methods in verbatim (original text) and paraphrase settings, as well as analyzing performance on texts of different lengths (32, 64, 128, 256 tokens). Its automated pipeline allows for continuous updates with recent data, making it suitable for evaluating new LLMs.
The paper proposes a novel, reference-free detection method called Min-k. The core hypothesis is that unseen examples are more likely to contain a few outlier tokens with very low probabilities (high negative log-likelihood), whereas seen examples are less likely to exhibit such low probabilities. Min-k calculates the average log-likelihood of the k% of tokens in a given text that have the minimum probabilities (i.e., highest negative log-likelihoods).
For a sequence x=x1,…,xN, the Min-k score is calculated as:
Min−k(x)=E1xi∈Min-K%(x)∑logp(xi∣x1,...,xi−1)
where Min-K%(x) is the set of k% tokens with the minimum probabilities (highest negative log-likelihoods), and E is the size of this set. A higher Min-k score (more negative average log-likelihood) indicates a higher likelihood of the text being a non-member. Detection is performed by thresholding this score. Min-k is implemented without needing access to the pretraining corpus or training a reference model. The authors found k=20 performed best on a validation set.
Experiments on WikiMIA using models like LLaMA, GPT-Neo, and Pythia demonstrate that Min-k consistently outperforms several baseline MIA methods, including the simple Perplexity (PPL) based approach, the Neighborhood/DetectGPT method, Zlib compression entropy, Lowercase perplexity, and a Smaller Reference model method. Min-k achieved an average AUC of 0.72, a 7.4% improvement over the best baseline (PPL). Analysis showed that detection performance improves with increasing model size and text length, supporting the intuition that larger models memorize more and longer texts offer more distinguishable information.
The paper further explores real-world applications through case studies:
- Detecting Copyrighted Books: Min-k was used to detect snippets from the Books3 dataset within GPT-3's training data. Using a validation set of known copyrighted books and newly published books, Min-k achieved an AUC of 0.88. Applying this to 10,000 snippets from 100 Books3 books, the paper found that nearly 90% of these books had a contamination rate exceeding 50%, providing strong evidence that GPT-3 was extensively trained on copyrighted material from Books3.
- Detecting Downstream Dataset Contamination: The authors simulated dataset contamination by adding downstream task examples (BoolQ, IMDB, etc.) to the RedPajama corpus and finetuning LLaMA 7B. Min-k successfully detected these contaminants, again outperforming baselines with an average AUC of 0.86. Ablation studies confirmed the theoretical predictions that detection is harder with decreasing learning rates and lower data occurrence frequency. However, counter-intuitively, detection of outlier contaminants (like downstream task examples inserted into a general corpus) became easier with increasing dataset size, likely because larger models better memorize such tail distributions. For in-distribution data, detection difficulty increased with dataset size, aligning with theory.
- Privacy Auditing of Machine Unlearning: Min-k was applied to audit a LLaMA2-7B model "unlearned" to remove Harry Potter content (2310.02238). By identifying text chunks and questions with similar Min-k scores between the original and unlearned models, the authors found instances where the unlearning failed. In story completion, the unlearned model generated completions highly similar to the original text for suspicious chunks identified by Min-k. In question answering, the unlearned model correctly answered Harry Potter-related questions identified by Min-k. These findings highlight that machine unlearning is imperfect and auditing tools like Min-k are crucial for verifying compliance with privacy regulations.
In summary, the paper introduces a practical benchmark WikiMIA and a simple yet effective reference-free method Min-k for detecting pretraining data in LLMs. The case studies demonstrate Min-k's utility in identifying potentially problematic training data, such as copyrighted books and contaminated benchmarks, and in auditing machine unlearning processes.