- The paper introduces Adaptive RMU for selectively erasing sensitive factual data from LLMs while maintaining overall model performance.
- The paper employs dynamic scaling and layerwise analysis to optimize unlearning efficacy, identifying later transformer layers as key targets.
- The paper demonstrates that targeted unlearning minimizes membership inference risks and supports privacy compliance without significant utility loss.
Adaptive RMU for Unlearning Factual Knowledge from LLMs at SemEval-2025 Task 4
This paper systematically investigates the use of Adaptive Representation Misdirection Unlearning (Adaptive RMU) for selective unlearning in LLMs, with an application to the SemEval-2025 Task 4 competition. The paper targets factual knowledge—such as personally identifiable information (PII) and synthetic facts implanted via fine-tuning—an area of high sensitivity due to privacy and regulatory concerns. The results demonstrate both competitive unlearning efficacy and maintainance of model utility, with extensive analysis of the underlying layerwise dynamics.
Motivation and Problem Setting
Current LLMs retain information from their training data in a diffuse manner, complicating task-specific unlearning. Conventional unlearning methods often induce catastrophic forgetting, undermining generalization. The SemEval-2025 Task 4 presents a controlled testbed, dividing data into "forget" and "retain" sets across several subtasks that include creative texts, synthetic biographies with PII, and segments of real training data. The tools for evaluation span regurgitation rates, membership inference metrics, and the MMLU benchmark (measuring general knowledge and reasoning).
Approach: Adaptive RMU
Building on the RMU framework, Adaptive RMU modifies the model's internal representations as follows:
- Forget loss: For the forget set, activations at selected transformer decoder layers are steered toward a random direction, scaled adaptively by the magnitude of the original activations. This dynamic adjustment preserves stability across examples of varying norm.
- Retain loss: On the retain set, a standard L2 penalty enforces similarity between the activations of the adapted and the original frozen model.
- Objective: The ultimate loss is a weighted sum that controls the trade-off between removal of targeted factual knowledge and the preservation of general capability.
Implementation uses consecutive blocks of three decoder layers. The main hyperparameter is the selection of which layers to update, which the authors sweep systematically; adaptive scaling is managed per instance by the activation norms from the original frozen model.
Pseudocode Sketch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
for epoch in range(num_epochs):
for x_forget, x_retain in dataloader:
# Forward pass: get activations at layers l, l+1, l+2
a_frozen_forget = frozen_model.get_activations(x_forget, layers)
a_frozen_retain = frozen_model.get_activations(x_retain, layers)
a_adapted_forget = adapted_model.get_activations(x_forget, layers)
a_adapted_retain = adapted_model.get_activations(x_retain, layers)
# Compute adaptive scaling
beta = hyper_beta
u = random_unit_vector(d)
forget_loss = mse(
a_adapted_forget,
beta * torch.norm(a_frozen_forget, dim=-1, keepdim=True) * u
).mean()
retain_loss = mse(a_adapted_retain, a_frozen_retain).mean()
total_loss = forget_loss + alpha * retain_loss
total_loss.backward()
optimizer.step() |
Empirical Findings
The principal results are as follows:
- Effectiveness of Adaptive RMU: The approach achieves strong aggregate scores, outperforming prior baselines (such as gradient-based and negative preference optimization) on the 1B and 7B parameter leaderboards.
- Layerwise Analysis:
- Later decoder layers (12–14 for OLMo-1B; 24–26 for OLMo-7B) are optimal for unlearning factual information. This is evident in the final scores and robust reduction in membership inference attack (MIA) success, with minor trade-offs on knowledge retention.
- Middle layers yield a favorable balance of task aggregate and MMLU scores but are less robust against MIAs, implying latent vulnerabilities. Conversely, unlearning at the earliest layers is least effective for removing memorized factual data.
- Computational Constraints: All experiments are conducted on a modest cluster (4×RTX 3090 24GB), with practicality in hyperparameter search due to the simplicity of layer selection as the key knob.
- Preservation of Utility: Minimal impact is observed on the MMLU benchmark, especially when unlearning is targeted at later layers, confirming the separation between factual memorization and general reasoning in these regions of the model.
Comparative Baseline Analysis
Baseline methods such as gradient ascent, gradient difference, and KL minimization are either ineffective or lead to excessive degradation of the model, falling below MMLU performance thresholds. Adaptive RMU's task aggregate and privacy metrics represent a substantive empirical improvement over these techniques.
Implications and Future Directions
This paper demonstrates that targeted layerwise unlearning with adaptive scaling can selectively erase factual knowledge—including synthetic PII—without substantial degradation of general capability in LLMs of both moderate and larger scale. Crucially, it highlights the distribution of factual knowledge across later transformer layers, contrasting with findings in previous work on hazardous knowledge (e.g., WMDP benchmarks), which is more distributed or situated in earlier layers.
Key implications include:
- Practical Contribution: The method enables compliance with privacy regulations such as the GDPR's right to erasure without requiring full model retraining or producing marked performance regressions.
- Mechanistic Interpretability: The layerwise findings support and refine current mechanistic hypotheses about the localization of knowledge types within transformer architectures.
- Scalability: Minimal reliance on intensive hyperparameter tuning, and applicability to various LLM architectures, make this method a strong candidate for production-grade privacy and knowledge management workflows.
Future work could extend these findings in several ways:
- Finer-grained layer selection—to further minimize utility loss during unlearning.
- Exploration of broader knowledge domains (e.g., deep conceptual or multilingual knowledge).
- Integration with interpretability tools to identify causal pathways of memorization and retention, synergizing with editing and redaction pipelines.
Conclusion
The results underscore Adaptive RMU as a viable, efficient solution for targeted unlearning in LLMs. The insights into layer dependence inform both practical model editing strategies and a deeper understanding of transformer internal dynamics, marking a clear advancement for privacy, safety, and robust deployment of large-scale language technologies.