TOFU & WMDP: LLM Unlearning Benchmarks

Updated 10 February 2026

TOFU and WMDP benchmarks are evaluation frameworks that assess machine unlearning by defining clear forget/retain splits using synthetic and real hazardous data.
TOFU employs controlled synthetic profiles to measure residual factual knowledge, while WMDP uses domain-specific multiple-choice tasks to probe safety risks.
Robust metrics, including utility-accuracy trade-offs and adversarial probing, reveal the challenging balance between effective forgetting and maintaining general model performance.

TOFU and WMDP Benchmarks

TOFU (Task Of Fictitious Unlearning) and WMDP (Weapons of Mass Destruction Proxy/Data Prompting) are the two principal benchmarks for evaluating machine unlearning in LLMs. These frameworks probe complementary aspects of machine unlearning: selective factual erasure with high precision (TOFU) and the removal of hazardous or sensitive capabilities in realistic, adversarial scenarios (WMDP). Both benchmarks have become foundational for robust, reproducible, and nuanced assessment of unlearning algorithms in the current literature.

1. Benchmark Definitions and Design Objectives

TOFU was introduced to rigorously quantify the extent to which fine-tuned LLMs can be made to behave as if they had never learned a target subset of data, with full knowledge of exactly what was injected post-pretraining (Maini et al., 2024). Its dataset comprises 200 synthetic author profiles, each with 20 question–answer pairs, generated such that the presence or absence of any specific fact is unambiguous. “Forget sets” vary: 1%, 5%, or 10% subsets of authors, with all corresponding Q&A pairs; the remainder is the “retain set.” The model is first finetuned on the full dataset, then subjected to unlearning interventions so that it forgets all information about the designated forget set, while retaining performance on the retain set and demonstrating minimal collateral forgetting on two out-of-domain (OOD) evaluation pools: Q&As about real authors, and generic world-facts.

WMDP, in contrast, uses real hazardous-knowledge corpora—domains such as biosecurity, cybersecurity, and chemical synthesis—to pose a more capability-centric unlearning challenge. The core task comprises thousands of domain-specific multiple-choice questions, each probing knowledge that, if retained, could constitute a regulatory or safety risk. The objective is to reduce accuracy on these “harmful” tasks to chance level (e.g., 0.25 on 4-way MCQ), without sacrificing general reasoning utility as measured on broad benchmarks like MMLU (Pal et al., 14 Apr 2025, Sanyal et al., 1 Feb 2025).

These benchmarks collectively stress the dual desiderata of (i) precise, selective forgetting and (ii) broad utility and robustness retention.

2. Dataset Structure and Task Formats

TOFU’s entire corpus is synthetic and controlled. Each author profile has 20 Q&As, with all author names spelled out to minimize ambiguity, and with facts never appearing in any pretraining set. Task format is free-form: models must generate plausible, faithful answers—enabling fine-grained measurement of residual factual knowledge and generative “refusal,” e.g., outputting “I don’t know.”

WMDP operates on real-world, high-risk content. The forget set is a fixed set (e.g., 600 documents or ~1200 domain MCQs per domain), paired with a complementary utility-retain set for training and evaluation (often drawn from MMLU or unrelated text corpora). All WMDP tasks are multiple-choice, often requiring discrimination between benign and hazardous associations—making it difficult to simply “forget” without utility collapse, given the entanglement of facts (Pal et al., 14 Apr 2025).

Both benchmarks explicitly define forget/retain splits, facilitate randomized or heuristic coreset selection, and stratify test sets to measure short-range (in-domain) and long-range (out-of-domain) collateral forgetting (Pal et al., 14 Apr 2025).

3. Evaluation Metrics and Protocols

TOFU’s framework is multidimensional, with orthogonal axes for Model Utility and Forget Quality (Maini et al., 2024):

Model Utility is computed as the harmonic mean of normalized answer probability, ROUGE-L recall, and “truth ratio” (relative ranking of the correct completion over GPT-4-generated paraphrases and plausible but wrong distractors), across the retain, related, and distant test sets.
Forget Quality measures the statistical indistinguishability (via a 2-sample Kolmogorov–Smirnov test on truth ratio distributions) between the unlearned model and a “retain-only” model on the forget set; a high p-value (>0.05) is considered the target of perfect unlearning.
Additional metrics: extraction strength, keyword accuracy/confidence, and model utility decay are often computed, particularly in recent works employing adversarial probing and privacy attacks (Reisizadeh et al., 7 Nov 2025, Rybak et al., 5 Feb 2026).

WMDP relies on forgetting efficacy as a drop in domain-specific MCQ accuracy, with the ideal being a reduction to random chance, and general utility measured as MMLU accuracy retention. Many works supplement this with membership-inference attacks, document-level memorization, perplexity ratios, and privacy-leak metrics (e.g., GPT-judge privacy score), as well as robust evaluation via adversarial and probabilistic sampling (Sanyal et al., 1 Feb 2025, Anjarlekar et al., 8 Aug 2025, Rybak et al., 5 Feb 2026).

Both benchmarks are now integrated into unified meta-frameworks such as OpenUnlearning (Dorna et al., 14 Jun 2025), which offers over a dozen standardized metrics for direct, reproducible comparison.

4. Unlearning Algorithms and Baselines

TOFU and WMDP have driven the development and benchmarking of a diverse array of unlearning algorithms:

Optimization-based methods: Gradient Ascent, Gradient Difference, KL-minimization, Negative Preference Optimization (NPO), Representation Misdirection Unlearning (RMU), and I-don't-know (IDK) preference/loss formulations. These algorithms typically operate by reducing the model’s likelihood or representation similarity with the forget set, sometimes under explicit utility constraints on the retain set (Maini et al., 2024, Pal et al., 14 Apr 2025).
Post-hoc and model-agnostic approaches: ALU (Agentic LLM Unlearning) employs a multi-agent, retrain-free system with vanilla, AuditErase, Critic, and Composer agents, using prompt engineering and ensemble scoring to “sanitize” outputs without model weight updates (Sanyal et al., 1 Feb 2025).
Parameter-efficient and adaptive adapters: LoRA/FILA/VILA frameworks assign low-rank adapters or filter parameters by Fisher information or gradient ratio, allowing scalable, efficient, and modular updates (Kim et al., 29 Aug 2025, Anjarlekar et al., 8 Aug 2025).
Evolutionary and automatic search: EvoMU synthesizes novel unlearning objectives via LLM-driven evolutionary search, demonstrating task-specific gains by optimizing for the unique structure of each benchmark (Batorski et al., 2 Feb 2026).
Federated/distributed techniques: FULM demonstrates hierarchical federated learning that supports unlearning under continual, decentralized, and heterogeneous data silos (Zhong et al., 19 Oct 2025).

Baselines in both benchmarks always include full retrain (“gold standard”/oracle), optimization baselines (IDK, Gradient Difference, NPO, RMU), and increasingly, adversarial meta-attacks (discussed further below).

5. Empirical Results and Key Insights

The dominant insight from both benchmarks is the stubborn trade-off between effective forgetting and preservation of model utility. For TOFU, no baseline method achieved statistically perfect forgetting (KS-p > 0.05) on even the smallest (1%) forget sets without severely degrading utility, and performance always declined on OOD knowledge, evidencing strong knowledge entanglement (Maini et al., 2024, Dorna et al., 14 Jun 2025).

WMDP results similarly show that methods which drive hazardous-task accuracy to random (e.g., RMU, aggressive KL, or OBLIVIATE’s masking) generally incur utility losses on MMLU or exhibit poor fluency. The coreset effect revealed by Li et al. shows that on WMDP, unlearning is surprisingly data efficient: randomly selecting 5% of forget documents suffices for nearly the same unlearning effect as the full forget set, a phenomenon attributed to the dominance of high-impact keywords (Pal et al., 14 Apr 2025).

Recent works such as ALU demonstrate that multi-agent, non-gradient post-hoc techniques can robustly suppress forbidden knowledge, sustaining low leakage rates even as the number of targets or adversarial prompts is scaled to thousands (Sanyal et al., 1 Feb 2025). However, the security of this unlearning is vulnerable: meta-evaluations using Leak@ $k$ and REBEL evolutionary prompts demonstrate that probabilistic or adaptive sampling quickly exposes latent memorization. Even exact unlearning by retraining is vulnerable to “difference attacks” leveraging both pre- and post-unlearned checkpoints (Reisizadeh et al., 7 Nov 2025, Wu et al., 30 May 2025, Rybak et al., 5 Feb 2026).

6. Attack Surfaces and Robustness Evaluation

TOFU and WMDP serve as primary testbeds for adversarial robustness analysis of unlearning mechanisms. Leakage under diverse threat models is now rigorously quantified:

Leak@ $k$ metrics (Reisizadeh et al., 7 Nov 2025) demonstrate that as the number of sampled model outputs increases or sampling parameters (temperature, top-p) are made realistic, even retrain and state-of-the-art unlearned models frequently leak forbidden knowledge.
Evolutionary adversarial prompts (as in REBEL (Rybak et al., 5 Feb 2026)) can recover up to 60% of TOFU and 93% of WMDP “forgotten” knowledge, outperforming naive stochastic sampling and static attack baselines.
Model-differencing attacks illustrate that exact unlearning is not absolute: an attacker with access to both pre- and post-unlearning networks can recover forgotten continuations at substantially increased rates relative to chance (Wu et al., 30 May 2025).
Metric faithfulness work (OpenUnlearning (Dorna et al., 14 Jun 2025)) shows that surface-level metric drops (ROUGE-L, EM, extraction strength) can be gamed, and stress testing (e.g., quantization/relearning) is necessary to assess true erasure.

A plausible implication is that truly robust machine unlearning now requires comprehensive multi-view assessment: deterministic (“greedy”) and probabilistic generations, adversarial evolution, and consideration of potentially compromised checkpoints.

7. Comparative Significance and Extensions

TOFU and WMDP have catalyzed the standardization and advancement of LLM unlearning evaluation:

TOFU excels in diagnosing fine-grained memory and entanglement effects arising from controlled, synthetic injections. WMDP is preferred for probing “real” capability deletion, adversarial defense, and practical safety/utility trade-offs.
Unified benchmarking platforms (e.g., OpenUnlearning (Dorna et al., 14 Jun 2025)) allow head-to-head evaluation of algorithms, aggregation of metric suites, and systematic ablation and stress-testing—accelerating progress and reproducibility.
The emergence of strong coreset effects (Pal et al., 14 Apr 2025), quantization of utility-forgetting trade-offs, and cross-domain (federated/adaptive/parameter-efficient) methods suggest that the main frontier is now reliable, robust, and scalable deployment of unlearning—especially in the presence of adversarial or “gray-box” threat models.

TOFU and WMDP thus remain essential, complementary pillars for methodology, comparison, and certification in machine unlearning research for LLMs.