RippleBench-Bio: Evaluating Unlearning Effects

Updated 5 December 2025

RippleBench-Bio is a large-scale benchmark that quantifies direct and indirect ripple effects of model-editing (unlearning) interventions in biology-focused large language models.
It employs a systematic pipeline using semantic extraction, FAISS-indexed Wikipedia neighbors, and interval-based bucketization to generate robust multiple-choice evaluation sets.
Experimental results highlight tradeoffs among unlearning methods, with variations in immediate accuracy drops and long-distance recovery, informing safe and targeted model interventions.

RippleBench-Bio is a large-scale benchmark designed to quantify the ripple effects underlying model-editing or unlearning interventions in LLMs, particularly in the biology domain. Building on the RippleBench-Maker toolchain, it expands a biothreat-focused seed set (WMDP-Bio) into a multi-distance, semantically grounded suite of evaluative multiple-choice questions. It enables researchers to empirically measure knowledge loss both at direct targets and across semantically adjacent and distant concepts, thereby providing a rigorous testbed for evaluating the specificity and side-effect profiles of state-of-the-art unlearning methods (Rinberg et al., 3 Dec 2025).

1. Construction Methodology

RippleBench-Bio is generated through an end-to-end pipeline, RippleBench-Maker, which systematically captures and structures the propagation of model edits:

Topic Extraction: The process begins with a seed dataset (WMDP-Bio: 1,273 dual-use biology questions). Each question is mapped to a canonical Wikipedia topic via an instruction-tuned LLM such as Llama 3-8b-Instruct (e.g., mapping “mechanism of anthrax toxin production” to Bacillus anthracis).
Semantic-Distance Assignment and Ordering: Topics are expanded using a FAISS-indexed dump of English Wikipedia (April 10 2025), utilizing BAAI/bge-base embeddings. The WikiRAG API retrieves the top 1,000 nearest articles per seed, ranking them by embedding similarity. The semantic distance $d(c, c')$ is simply the rank of $c'$ in the WikiRAG result list for $c$ .
Bucketization: The 1,000 neighbors are grouped into 200 non-overlapping bins ( $\Delta = 5$ ), forming intervals such as [1–5], [6–10], ..., [996–1000].
Fact and Question Generation: For each neighbor $c'$ , key factual statements are extracted from the Wikipedia article and transformed by LLM prompting into verifiable, five-choice multiple-choice questions (four distractors per question). Spurious mappings and policy refusals are filtered ( $<$ 1% of cases).
Model Evaluation and Ripple Calculation: Both baseline ( $\theta$ ) and edited ( $\theta'$ ) models are evaluated on the resulting full question set, with utilities $g(\mathrm{model}, x)$ (1 if correct, 0 if incorrect).

2. Semantic Distance and Evaluation Structure

Semantic distance in RippleBench-Bio is precisely defined as:

$d(c, c') = k$ where $c'$ is the $k$ -th result in WikiRAG retrieval for $c$ .
Bins $B_j = \{\ell \mid (j-1)\cdot\Delta + 1 \leq \ell \leq j\cdot\Delta\}$ for $j=1,\ldots,200$ , categorize concepts by retrieval rank.
Each evaluation set $Q_j$ contains questions derived from concepts in bin $B_j$ .

This binning stratifies question sets by how semantically proximal or distal they are from the original unlearning target, allowing side-effect quantification at fine semantic granularity.

3. Dataset Composition

RippleBench-Bio is derived from WMDP-Bio, systematically expanded across Wikipedia:

Unique evaluation concepts: 70,706
Total multiple-choice questions: 352,961
Questions per concept: ≈5
Distance bins: 200 ([1–5], ..., [996–1000])
Mean questions per bin: ≈1,765
Topic coverage spans: direct biothreats (≤50), adjacent virology (50–100), broader biomedical (100–500), and unrelated knowledge (>500).

Filtration during construction excluded only two topics due to model refusal.

4. Ripple-Effect Evaluation Protocol

Quantitative evaluation is built on the following formal metrics:

Knowledge-Delta: $\Delta_U(\theta, \theta')(c') = U(\theta,c') - U(\theta',c')$ , where $U(\theta,c') = \mathbb{E}_{x \sim \text{questions of }c'}[g(\theta, x)]$ .
Ripple-Effect Function: $\mathcal{R}_{c,\mathcal{K},U,d}(x;\theta,\theta') = \mathbb{E}_{c'\sim\mathcal{K}|d(c,c')=x}[\Delta_U(\theta,\theta')(c')]$ .
Accuracy Drop per Bin: $\Delta\mathrm{Acc}(j) = \mathrm{Acc}_{\mathrm{baseline}}(j) - \mathrm{Acc}_{\mathrm{unlearned}}(j)$ , where $\mathrm{Acc}_\mathrm{model}(j) = \frac{1}{|Q_j|} \sum_{q\in Q_j} \mathbf{1}\{\mathrm{model~answers~q~correctly}\}$ .
Variability: For each bin $j$ , both mean $\Delta\mathrm{Acc}(j)$ and standard deviation $\sigma_j$ are computed; paired $t$ -tests on per-question correctness vectors support statistical comparison.

This protocol enables measurement of direct and indirect impacts of editing, binned by semantic distance.

5. Experimental Results and Side-Effect Profiles

Eight unlearning methods are evaluated on RippleBench-Bio:

Method	Profiled Ripple Effects (Selected Bins)
GradDiff, TAR	$>$ 25% drop at distance 1, with persistent 5-10% deficit even at distance $>$ 50
RMU, RMU+LAT, RR, PBJ	15–20% drop at distance 1, gradual recovery but tails to 5–10% at distance $>$ 50 for more aggressive methods
ELM	Smoothest recovery, accuracy approximates baseline by bin 50–100
RepNoise	Qualitatively similar to RMU and RR

Key observations include monotonic recovery of accuracy with respect to distance from the unlearned concept, with the smoothest and most complete recovery afforded by ELM. Aggressive local forgetting (e.g., GradDiff, TAR) tends to yield more persistent long-distance deficits. Temporal analysis over eight unlearning checkpoints (e.g., RMU vs. ELM) demonstrates that some methods (ELM) afford partial later recovery, while others (RMU) continue to degrade distant performance.

A plausible implication is that unlearning methods differ not only in immediate specificity but also in second-order semantic propagation, with tradeoffs between aggressiveness at the source and long-distance collateral damage.

6. Usage and Toolchain

RippleBench-Maker and WikiRAG, as well as the dataset, are publicly available (github.com/RoyRin/ripple_bench, github.com/RoyRin/wiki-rag, huggingface.co/datasets/RippleBench/ripple-bench). On-the-fly evaluation is supported via:

Python API:

from ripple_bench import RippleEvaluator

evaluator = RippleEvaluator(
   model_base="llama3-8b-instruct",
   model_unlearned="llama3-8b-instruct-elm-ckpt8",
   dataset="RippleBench/ripple-bench"
)
df = evaluator.evaluate()  # Pandas DataFrame [bin, acc_base, acc_unlearned, ΔAcc, σ]

Command-line interface:

ripple-eval \
  --model llama3-8b-instruct \
  --dataset RippleBench-Bio \
  --unlearning-method ELM \
  --output results/elm_results.json

This infrastructure allows rapid, rigorous comparison of methods and checkpoints, facilitating reproducible research on model-editing ripple effects.

7. Context and Applications

RippleBench-Bio equips the research community with a comprehensive, semantically-structured platform for measuring and analyzing unintended side-effects in LLM model-editing protocols, with a focus on unlearning in challenging, safety-relevant biomedical topics. It advances empirical rigor in quantifying model-editing specificity and continues to inform debates on safe and controllable model interventions (Rinberg et al., 3 Dec 2025). The framework is extensible to other domains where knowledge localization and ripple quantification are desirable.

PDF Markdown Chat (Pro)

References (1)

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RippleBench-Bio.