WikiMIA-32 Benchmark for LLM Membership Inference
- The paper introduces WikiMIA-32 as a temporally segmented dataset, enabling reliable membership inference by distinguishing pretraining exposure via Wikipedia event pages.
- It leverages the Min–k% Prob method, which uses top token loss analysis to robustly identify if a text snippet was included in a model’s training data.
- The benchmark uses strict temporal splitting and paraphrase augmentation to generate gold-standard labels, outperforming baseline methods by achieving a 7.4% relative AUC improvement.
WikiMIA-32 is a temporally-stratified, model-agnostic benchmark for evaluating membership inference attacks on LLMs. It was introduced to facilitate systematic and reliable detection of whether specific textual snippets were included in an LLM’s pretraining corpus, a question of central importance for auditing model behaviors in contexts involving copyright, privacy, or benchmark contamination. WikiMIA-32 provides gold-standard labels, rigorous partitioning to avoid contamination, and enables evaluation with only black-box access to models, leveraging a carefully-designed decision rule—Min–k% Prob—that exploits the statistical behavior of outlier token probabilities for robust inference (Shi et al., 2023).
1. Dataset Construction and Partitioning
WikiMIA-32 is derived exclusively from Wikipedia “Event” category pages, with construction designed to provide true positive (“member”) and true negative (“non-member”) examples.
- Temporal Split: Pages created before 2017 are designated as “members” (assumed to be safely included in pretraining for LLMs such as LLaMA, GPT-NeoX, OPT), while those created after January 1, 2023 are designated as “non-members,” guaranteeing absence from any LLM pretraining set.
- Filtering: Pages lacking narrative text—such as “List of …” or “Timeline of …”—are removed to ensure high-quality, content-rich examples.
- Sampling: 394 event pages post-2023 were collected as non-members, and 394 pre-2016 pages were randomly sampled as members, yielding a total of 788 distinct pages.
- Length Buckets: For each page, the first tokens (tokenized using the model’s native tokenizer) are extracted, forming length-specific cohorts named “WikiMIA-.” WikiMIA-32 comprises all 788 snippets truncated to 32 tokens ().
- Paraphrase Augmentation: Each 32-token snippet is also semantically paraphrased using ChatGPT, resulting in 788 original and 788 paraphrased examples per bucket.
- Gold-Truth Labeling: Membership labels are determined automatically by construction—no additional human annotation is necessary.
2. Evaluation Protocol and Inference Task
The core evaluation task in WikiMIA-32 is membership inference—determining, given an input snippet (32 tokens), whether appeared in a model’s pretraining data, under black-box access to the model which returns per-token probabilities .
- Label Assignment: , with $1$ indicating “in-pretraining” (member) and $0$ “not in pretraining” (non-member).
- Metrics:
- True Positive Rate (TPR):
- False Positive Rate (FPR):
- Area Under the ROC Curve (AUC):
- : TPR measured at a fixed
- Input Variants: Metrics are reported for both original and paraphrased examples, averaged where appropriate.
3. Min–k% Prob: Reference-Free Statistical Detection
WikiMIA-32 supports, and was instrumental in validating, the Min–k% Prob method for membership inference:
- Negative Log-Likelihoods: For , compute per-token losses .
- Outlier Amplification: For a fixed , calculate the average of the top highest values, denoted where indexes the largest losses.
- Decision Rule: Membership is predicted if (for a threshold ); is swept to generate a full ROC curve, so no fixed threshold is required.
- Rationale: Unseen (non-member) snippets frequently contain outlier tokens with anomalously low token probability, inflating and improving class separation.
4. Baseline Methods and Comparative Performance
Five reference-free baselines are implemented for comparative evaluation on WikiMIA-32 (original and paraphrased text), all using only black-box access to model token probabilities:
| Method | Description | Avg. AUC |
|---|---|---|
| Neighbor | Neighbourhood curvature attack (DetectGPT) | 0.65 |
| PPL (LOSS) | Standard perplexity/loss attack | 0.67 |
| Zlib | Perplexity compared to zlib compression entropy | 0.65 |
| Lowercase | Perplexity difference after input lowercasing | 0.61 |
| SmallerRef | Perplexity under a smaller reference model | 0.66 |
| Min–k (ours) | Min– Prob, top 20% losses averaged | 0.72 |
- Performance: On WikiMIA-32 (original and paraphrased, averaged over 5 LLMs), Min–k achieves an average AUC of 0.72 versus the best baseline at 0.67, representing a 0.05 absolute and 7.4% relative improvement.
- Interpretation: This suggests that Min–k’s focus on extremal token losses captures model uncertainty arising from data absence more robustly than mean probability or other heuristics.
5. Experimental Details and Models
- Evaluated Models: Pythia-2.8B, GPT-NeoX-20B, LLaMA-30B, LLaMA-65B, OPT-66B, all open-source and representative of contemporary LLM architectures pretrained on Wikipedia.
- Hyperparameter Selection: varied in , optimized on a held-out LLaMA-65B validation split; selected.
- Data Preprocessing:
- Tokenization: Model-native BPE or tokenizer.
- Snippet Extraction: First tokens from each example, for WikiMIA-32.
- Paraphrasing: ChatGPT, fixed “paraphrase this in English” prompt.
- Prompting: No template prompt; left-to-right token likelihoods are queried with the bare snippet.
6. Design Principles and Update Strategy
- Accuracy: Temporal cutoff ensures that non-member examples cannot have been present in any model pretraining data up to the 2023 cutoff.
- Model-Agnosticism: Evaluation is supported for any LLM with Wikipedia-pretraining, with no LLM-specific adaptation required.
- Dynamic Growth: Ongoing, automated data collection via the Wikipedia public API enables continual expansion as new events are added.
- Generalizability: The paradigm is expected to remain valid for future models and datasets, provided strict temporal delineation is maintained.
7. Significance and Broader Implications
WikiMIA-32 defines a new standard for rigorous, scalable, and reliable membership inference benchmarking for LLMs. Its temporally-segmented, updatable construction, along with gold-standard labeling and support for both paraphrased and verbatim detection, establishes a strong empirical foundation for pretraining data auditing. The Min–k method substantiated on WikiMIA-32 demonstrates robust performance improvement, particularly in identifying model uncertainty associated with out-of-distribution tokens, with direct applications to privacy auditing, copyright detection, and integrity assurance for deployed generative models (Shi et al., 2023).