WikiMIA-32 Benchmark for LLM Membership Inference

Updated 2 February 2026

The paper introduces WikiMIA-32 as a temporally segmented dataset, enabling reliable membership inference by distinguishing pretraining exposure via Wikipedia event pages.
It leverages the Min–k% Prob method, which uses top token loss analysis to robustly identify if a text snippet was included in a model’s training data.
The benchmark uses strict temporal splitting and paraphrase augmentation to generate gold-standard labels, outperforming baseline methods by achieving a 7.4% relative AUC improvement.

WikiMIA-32 is a temporally-stratified, model-agnostic benchmark for evaluating membership inference attacks on LLMs. It was introduced to facilitate systematic and reliable detection of whether specific textual snippets were included in an LLM’s pretraining corpus, a question of central importance for auditing model behaviors in contexts involving copyright, privacy, or benchmark contamination. WikiMIA-32 provides gold-standard labels, rigorous partitioning to avoid contamination, and enables evaluation with only black-box access to models, leveraging a carefully-designed decision rule—Min–k% Prob—that exploits the statistical behavior of outlier token probabilities for robust inference (Shi et al., 2023).

1. Dataset Construction and Partitioning

WikiMIA-32 is derived exclusively from Wikipedia “Event” category pages, with construction designed to provide true positive (“member”) and true negative (“non-member”) examples.

Temporal Split: Pages created before 2017 are designated as “members” (assumed to be safely included in pretraining for LLMs such as LLaMA, GPT-NeoX, OPT), while those created after January 1, 2023 are designated as “non-members,” guaranteeing absence from any LLM pretraining set.
Filtering: Pages lacking narrative text—such as “List of …” or “Timeline of …”—are removed to ensure high-quality, content-rich examples.
Sampling: 394 event pages post-2023 were collected as non-members, and 394 pre-2016 pages were randomly sampled as members, yielding a total of 788 distinct pages.
Length Buckets: For each page, the first $N$ tokens (tokenized using the model’s native tokenizer) are extracted, forming length-specific cohorts named “WikiMIA- $N$ .” WikiMIA-32 comprises all 788 snippets truncated to 32 tokens ( $N=32$ ).
Paraphrase Augmentation: Each 32-token snippet is also semantically paraphrased using ChatGPT, resulting in 788 original and 788 paraphrased examples per bucket.
Gold-Truth Labeling: Membership labels are determined automatically by construction—no additional human annotation is necessary.

2. Evaluation Protocol and Inference Task

The core evaluation task in WikiMIA-32 is membership inference—determining, given an input snippet $x$ (32 tokens), whether $x$ appeared in a model’s pretraining data, under black-box access to the model $f_\theta$ which returns per-token probabilities $p(x_i \mid x_1, \dots, x_{i-1})$ .

Label Assignment: $h(x, f_\theta) \in \{0, 1\}$ , with $1$ indicating “in-pretraining” (member) and $0$ “not in pretraining” (non-member).
Metrics:
- True Positive Rate (TPR): $N$ 0
- False Positive Rate (FPR): $N$ 1
- Area Under the ROC Curve (AUC): $N$ 2
- $N$ 3: TPR measured at a fixed $N$ 4
Input Variants: Metrics are reported for both original and paraphrased examples, averaged where appropriate.

3. Min–k% Prob: Reference-Free Statistical Detection

WikiMIA-32 supports, and was instrumental in validating, the Min–k% Prob method for membership inference:

Negative Log-Likelihoods: For $N$ 5, compute per-token losses $N$ 6.
Outlier Amplification: For a fixed $N$ 7, calculate the average of the top $N$ 8 highest $N$ 9 values, denoted $N=32$ 0 where $N=32$ 1 indexes the largest losses.
Decision Rule: Membership is predicted if $N=32$ 2 (for a threshold $N=32$ 3); $N=32$ 4 is swept to generate a full ROC curve, so no fixed threshold is required.
Rationale: Unseen (non-member) snippets frequently contain outlier tokens with anomalously low token probability, inflating $N=32$ 5 and improving class separation.

4. Baseline Methods and Comparative Performance

Five reference-free baselines are implemented for comparative evaluation on WikiMIA-32 (original and paraphrased text), all using only black-box access to model token probabilities:

Method	Description	Avg. AUC
Neighbor	Neighbourhood curvature attack (DetectGPT)	0.65
PPL (LOSS)	Standard perplexity/loss attack	0.67
Zlib	Perplexity compared to zlib compression entropy	0.65
Lowercase	Perplexity difference after input lowercasing	0.61
SmallerRef	Perplexity under a smaller reference model	0.66
Min–k (ours)	Min– $N=32$ 6 Prob, top 20% losses averaged	0.72

Performance: On WikiMIA-32 (original and paraphrased, averaged over 5 LLMs), Min–k achieves an average AUC of 0.72 versus the best baseline at 0.67, representing a 0.05 absolute and 7.4% relative improvement.
Interpretation: This suggests that Min–k’s focus on extremal token losses captures model uncertainty arising from data absence more robustly than mean probability or other heuristics.

5. Experimental Details and Models

Evaluated Models: Pythia-2.8B, GPT-NeoX-20B, LLaMA-30B, LLaMA-65B, OPT-66B, all open-source and representative of contemporary LLM architectures pretrained on Wikipedia.
Hyperparameter Selection: $N=32$ 7 varied in $N=32$ 8, optimized on a held-out LLaMA-65B validation split; $N=32$ 9 selected.
Data Preprocessing:
- Tokenization: Model-native BPE or tokenizer.
- Snippet Extraction: First $x$ 0 tokens from each example, $x$ 1 for WikiMIA-32.
- Paraphrasing: ChatGPT, fixed “paraphrase this in English” prompt.
Prompting: No template prompt; left-to-right token likelihoods are queried with the bare snippet.

6. Design Principles and Update Strategy

Accuracy: Temporal cutoff ensures that non-member examples cannot have been present in any model pretraining data up to the 2023 cutoff.
Model-Agnosticism: Evaluation is supported for any LLM with Wikipedia-pretraining, with no LLM-specific adaptation required.
Dynamic Growth: Ongoing, automated data collection via the Wikipedia public API enables continual expansion as new events are added.
Generalizability: The paradigm is expected to remain valid for future models and datasets, provided strict temporal delineation is maintained.

7. Significance and Broader Implications

WikiMIA-32 defines a new standard for rigorous, scalable, and reliable membership inference benchmarking for LLMs. Its temporally-segmented, updatable construction, along with gold-standard labeling and support for both paraphrased and verbatim detection, establishes a strong empirical foundation for pretraining data auditing. The Min–k method substantiated on WikiMIA-32 demonstrates robust performance improvement, particularly in identifying model uncertainty associated with out-of-distribution tokens, with direct applications to privacy auditing, copyright detection, and integrity assurance for deployed generative models (Shi et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Detecting Pretraining Data from Large Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WikiMIA-32 Benchmark.