Papers
Topics
Authors
Recent
Search
2000 character limit reached

WikiMIA-32 Benchmark for LLM Membership Inference

Updated 2 February 2026
  • The paper introduces WikiMIA-32 as a temporally segmented dataset, enabling reliable membership inference by distinguishing pretraining exposure via Wikipedia event pages.
  • It leverages the Min–k% Prob method, which uses top token loss analysis to robustly identify if a text snippet was included in a model’s training data.
  • The benchmark uses strict temporal splitting and paraphrase augmentation to generate gold-standard labels, outperforming baseline methods by achieving a 7.4% relative AUC improvement.

WikiMIA-32 is a temporally-stratified, model-agnostic benchmark for evaluating membership inference attacks on LLMs. It was introduced to facilitate systematic and reliable detection of whether specific textual snippets were included in an LLM’s pretraining corpus, a question of central importance for auditing model behaviors in contexts involving copyright, privacy, or benchmark contamination. WikiMIA-32 provides gold-standard labels, rigorous partitioning to avoid contamination, and enables evaluation with only black-box access to models, leveraging a carefully-designed decision rule—Min–k% Prob—that exploits the statistical behavior of outlier token probabilities for robust inference (Shi et al., 2023).

1. Dataset Construction and Partitioning

WikiMIA-32 is derived exclusively from Wikipedia “Event” category pages, with construction designed to provide true positive (“member”) and true negative (“non-member”) examples.

  • Temporal Split: Pages created before 2017 are designated as “members” (assumed to be safely included in pretraining for LLMs such as LLaMA, GPT-NeoX, OPT), while those created after January 1, 2023 are designated as “non-members,” guaranteeing absence from any LLM pretraining set.
  • Filtering: Pages lacking narrative text—such as “List of …” or “Timeline of …”—are removed to ensure high-quality, content-rich examples.
  • Sampling: 394 event pages post-2023 were collected as non-members, and 394 pre-2016 pages were randomly sampled as members, yielding a total of 788 distinct pages.
  • Length Buckets: For each page, the first NN tokens (tokenized using the model’s native tokenizer) are extracted, forming length-specific cohorts named “WikiMIA-NN.” WikiMIA-32 comprises all 788 snippets truncated to 32 tokens (N=32N=32).
  • Paraphrase Augmentation: Each 32-token snippet is also semantically paraphrased using ChatGPT, resulting in 788 original and 788 paraphrased examples per bucket.
  • Gold-Truth Labeling: Membership labels are determined automatically by construction—no additional human annotation is necessary.

2. Evaluation Protocol and Inference Task

The core evaluation task in WikiMIA-32 is membership inference—determining, given an input snippet xx (32 tokens), whether xx appeared in a model’s pretraining data, under black-box access to the model fθf_\theta which returns per-token probabilities p(xix1,,xi1)p(x_i \mid x_1, \dots, x_{i-1}).

  • Label Assignment: h(x,fθ){0,1}h(x, f_\theta) \in \{0, 1\}, with $1$ indicating “in-pretraining” (member) and $0$ “not in pretraining” (non-member).
  • Metrics:
    • True Positive Rate (TPR): TPR=TPTP+FN\mathrm{TPR} = \frac{TP}{TP + FN}
    • False Positive Rate (FPR): FPR=FPFP+TN\mathrm{FPR} = \frac{FP}{FP + TN}
    • Area Under the ROC Curve (AUC): AUC=01TPR(FPR1(t))dt\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(t))\,dt
    • TPR@5%FPRTPR@5\%FPR: TPR measured at a fixed FPR=5%\mathrm{FPR} = 5\%
  • Input Variants: Metrics are reported for both original and paraphrased examples, averaged where appropriate.

3. Min–k% Prob: Reference-Free Statistical Detection

WikiMIA-32 supports, and was instrumental in validating, the Min–k% Prob method for membership inference:

  • Negative Log-Likelihoods: For x=x1xNx = x_1 \dots x_N, compute per-token losses i=logp(xix1,,xi1)\ell_i = -\log p(x_i \mid x_1, \dots, x_{i-1}).
  • Outlier Amplification: For a fixed k%k\%, calculate the average of the top k%N\lceil k\% \cdot N \rceil highest i\ell_i values, denoted Mink(x)=1EiSiMin\textrm{–}k(x) = \frac{1}{E} \sum_{i \in S} \ell_i where SS indexes the largest losses.
  • Decision Rule: Membership is predicted if Mink(x)ϵMin\textrm{–}k(x) \leq \epsilon (for a threshold ϵ\epsilon); ϵ\epsilon is swept to generate a full ROC curve, so no fixed threshold is required.
  • Rationale: Unseen (non-member) snippets frequently contain outlier tokens with anomalously low token probability, inflating MinkMin\textrm{–}k and improving class separation.

4. Baseline Methods and Comparative Performance

Five reference-free baselines are implemented for comparative evaluation on WikiMIA-32 (original and paraphrased text), all using only black-box access to model token probabilities:

Method Description Avg. AUC
Neighbor Neighbourhood curvature attack (DetectGPT) 0.65
PPL (LOSS) Standard perplexity/loss attack 0.67
Zlib Perplexity compared to zlib compression entropy 0.65
Lowercase Perplexity difference after input lowercasing 0.61
SmallerRef Perplexity under a smaller reference model 0.66
Min–k (ours) Min–k%k\% Prob, top 20% losses averaged 0.72
  • Performance: On WikiMIA-32 (original and paraphrased, averaged over 5 LLMs), Min–k achieves an average AUC of 0.72 versus the best baseline at 0.67, representing a 0.05 absolute and 7.4% relative improvement.
  • Interpretation: This suggests that Min–k’s focus on extremal token losses captures model uncertainty arising from data absence more robustly than mean probability or other heuristics.

5. Experimental Details and Models

  • Evaluated Models: Pythia-2.8B, GPT-NeoX-20B, LLaMA-30B, LLaMA-65B, OPT-66B, all open-source and representative of contemporary LLM architectures pretrained on Wikipedia.
  • Hyperparameter Selection: kk varied in {10,20,30,40,50}\{10, 20, 30, 40, 50\}, optimized on a held-out LLaMA-65B validation split; k=20%k=20\% selected.
  • Data Preprocessing:
    • Tokenization: Model-native BPE or tokenizer.
    • Snippet Extraction: First NN tokens from each example, N=32N=32 for WikiMIA-32.
    • Paraphrasing: ChatGPT, fixed “paraphrase this in English” prompt.
  • Prompting: No template prompt; left-to-right token likelihoods are queried with the bare snippet.

6. Design Principles and Update Strategy

  • Accuracy: Temporal cutoff ensures that non-member examples cannot have been present in any model pretraining data up to the 2023 cutoff.
  • Model-Agnosticism: Evaluation is supported for any LLM with Wikipedia-pretraining, with no LLM-specific adaptation required.
  • Dynamic Growth: Ongoing, automated data collection via the Wikipedia public API enables continual expansion as new events are added.
  • Generalizability: The paradigm is expected to remain valid for future models and datasets, provided strict temporal delineation is maintained.

7. Significance and Broader Implications

WikiMIA-32 defines a new standard for rigorous, scalable, and reliable membership inference benchmarking for LLMs. Its temporally-segmented, updatable construction, along with gold-standard labeling and support for both paraphrased and verbatim detection, establishes a strong empirical foundation for pretraining data auditing. The Min–k method substantiated on WikiMIA-32 demonstrates robust performance improvement, particularly in identifying model uncertainty associated with out-of-distribution tokens, with direct applications to privacy auditing, copyright detection, and integrity assurance for deployed generative models (Shi et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WikiMIA-32 Benchmark.