R²MU: Reasoning-Aware Representation Unlearning
- The paper introduces R²MU, a targeted machine unlearning paradigm that intervenes in internal belief representations to suppress spurious knowledge and reinforce accurate inferences.
- It employs advanced trace extraction techniques—such as FBBS for textual models and tailored attribution for multimodal architectures—to isolate and classify reasoning paths.
- Empirical evaluations on benchmarks like HotpotQA and MMLU demonstrate substantial improvements in unlearning accuracy and retention, highlighting R²MU’s practical impact.
Reasoning-aware Representation Misdirection for Unlearning (R²MU) is a targeted machine unlearning paradigm for LLMs and large reasoning models (LRMs) that intervenes not only at the surface level of model outputs but directly within the internal representations underlying reasoning and factual recall. By localizing, classifying, and then modifying internal belief- or knowledge-bearing structures, R²MU fundamentally seeks to suppress spurious or undesirable knowledge while preserving or bolstering accurate and desirable inference behaviors. This framework, through variants adapted to both text and multimodal architectures, addresses the limits of earlier approaches that failed to eradicate latent knowledge—often leaving it only superficially suppressed.
1. Conceptual Foundations and Formal Structure
R²MU centers on the idea that model errors or hazardous behaviors originate from internal “beliefs” or structured activations governing inference. For a model with natural-language proposition set and an internal belief predicate , the belief space is . For a given prompt and two candidate answers (, ), the belief subsets (spurious) and (true) comprise those beliefs implicated in the model’s incorrect and correct reasoning paths, respectively. The goal is then to systematically identify, categorize, and rectify these beliefs through targeted parameter and representation updates (Niwa et al., 28 Feb 2025, Wang et al., 15 Jun 2025).
2. Extraction of Reasoning Traces and Internal Beliefs
The identification of the relevant beliefs or reasoning traces necessitates sophisticated trace extraction strategies. In text-only LLMs, the Forward-Backward Beam Search (FBBS) algorithm is used to prompt for explicit beliefs leading to each answer. FBBS operates by maximizing the joint probability , employing a staged re-ranking that dynamically balances forward and backward log-probabilities over the belief tokens. This approach allows extraction of beliefs whose activation causally underpins the model’s observed answers—permitting post-hoc classification into spurious and true categories (Niwa et al., 28 Feb 2025).
In multimodal architectures, belief identification generalizes to modality-specific influential neuron paths, constructed by maximizing per-layer attribution scores: textual traces are found through Inter-layer Gradient Integration (IGI), while visual activations are traced via Inter-layer Fisher Integration (IFI). These approaches yield paths of high-attribution neurons (one per layer) in each modality, operationalizing the “reasoning trace” at the level of model internals rather than explicit CoT tokens (Li et al., 10 Nov 2025).
3. Representation Misdirection Objectives
The core innovation of R²MU lies in its gradient-based misdirection of hidden representations. Once the target beliefs or neuron paths are determined, learning objectives are formulated to:
- Suppress spurious beliefs by maximizing the loss for producing erroneous outputs given those beliefs.
- Enhance true beliefs by minimizing the loss for gold outputs given the identified correct beliefs.
Mathematically, this is instantiated as
where 0 and 1 are cross-entropy losses over the spurious and true belief sets, respectively. Optimization is typically by gradient ascent, explicitly reweighting the model such that spurious inferences become less likely, and true inferences more likely, when conditioned on their activation (Niwa et al., 28 Feb 2025).
In the context of factual or sensitive knowledge unlearning, the loss decomposes further: hidden representations for forget-set inputs are steered toward random vectors (decoys), while representations for retain-set inputs are preserved close to their original values. This dual-objective is commonly formalized as
2
where the forget loss enforces deviation from the original representation (magnitude-matched to ensure stable geometry), and the retain loss anchors retention of general capabilities (Wang et al., 15 Jun 2025, Li et al., 10 Nov 2025).
4. Algorithmic Procedures and Implementation
The concrete instantiation of R²MU varies with the modality and model architecture:
- Textual LLMs: After extracting 3 and 4 via FBBS, batch-wise gradients are computed for the losses defined in the previous section. Parameters are updated so that the model’s reliance on spurious beliefs is minimized and on true beliefs is amplified. This representation-level update “misdirects” the latent features toward or away from particular inference traces (Niwa et al., 28 Feb 2025).
- Multimodal LLMs (MLLMs): The MIP-Editor variant identifies the influential neuron path for each modality, then zeroes activations on these paths (blocking the information flow) before finetuning only the implicated neurons. During finetuning, the R²MU loss pushes the forget-set representations into random subspaces (sampled on each update) and retains non-target reasoning by aligning retain-set paths to their pre-edit values. Only the pruned path neurons are updated, maximizing both unlearning selectivity and utility preservation (Li et al., 10 Nov 2025).
- Sensitive Reasoning Trace Unlearning: For reasoning models producing multi-step chains-of-thought (CoT), R²MU extends the loss to both the final output and all intermediary reasoning steps. Each CoT segment is prompted and misdirected independently. A “reasoning-preservation” term, evaluated on clean CoT data, anchors the model’s reasoning skills during unlearning, preventing catastrophic forgetting (Wang et al., 15 Jun 2025).
5. Evaluation Protocols and Empirical Performance
R²MU-based systems are compared to baselines that suppress only answer tokens, suppress entire data instances without reasoning trace localization, or edit only surface-level outputs. Quantitative evaluation employs metrics such as:
- Final-Answer Unlearning Accuracy (FA-UA)
- Reasoning-Trace Unlearning Accuracy (RT-UA)
- Overall utility retention (e.g., on MMLU, AIME, or domain QA)
- Forgetting rate and retention rate in multimodal settings
Empirical results demonstrate:
- On HotpotQA, SciQ, and OpenBookQA, belief-level unlearning via R²MU produces substantial gains (2–6.4 percentage points) in corrected error rates and generalization, outperforming suppression-at-output or raw-data baselines (Niwa et al., 28 Feb 2025).
- In multimodal settings, MIP-Editor achieves up to 87.75% forgetting on forgetting tasks, with retention performance rising by as much as 54.26% relative to naïve full-model or point-wise neuron editing baselines (Li et al., 10 Nov 2025).
- For reasoning models, R²MU achieves near-complete suppression of reasoning-trace leakage, driving RT-UA to ~1% while retaining >80% pre-unlearning accuracy on math and safety domains. By contrast, classical representation misdirection or negative-preference algorithms either leave traces or degrade general reasoning (Wang et al., 15 Jun 2025).
- Representation analysis reveals that most spurious beliefs are not memorized verbatim from pretraining data, highlighting the necessity of reasoning-aware (rather than instance-centric) unlearning.
| Model/Task | Pre-unlearning | R²MU Unlearned | Retention Baseline | Forgetting Baseline |
|---|---|---|---|---|
| HotpotQA (train gain, pp) | – | +6.4 | <+2.4 | <+4.0 |
| FVQA (forget %, Qwen-VL) | – | 87.75 | 0–12 | 30–60 |
| RT-UA, LRM (%) | 72.49 | 1.02 | 19.71 | – |
| Retention (MMLU, LRM) | 53.00 | 46.36 | 46.00 | – |
6. Generalization, Limitations, and Current Directions
The reasoning- and representation-centric philosophy of R²MU enables robust out-of-domain and evaluation-set generalization. Suppression of entire reasoning subspaces, as opposed to outputs alone, mitigates overfitting to training artifacts and produces more globally consistent model corrections. In multimodal and reasoning-prior architectures, trade-offs arise between depth/rank of suppression and utility retention, with higher-capacity models exhibiting sharper signature localization and therefore supporting stronger, stabler erasure (Mahmood et al., 15 Jan 2026).
Limitations include the substantial computational overhead of search-based belief extraction (FBBS), the need for fine-grained hyperparameter tuning, the lack of certified guarantees on total erasure, and absence of human-in-the-loop assessment for extracted beliefs or traces. Theoretical understanding of the representational "move" versus full retraining remains an open topic. Adversarial relearning and continual adaptation of R²MU-edited models have not been systematically studied (Niwa et al., 28 Feb 2025, Wang et al., 15 Jun 2025, Mahmood et al., 15 Jan 2026).
7. Future Directions and Broader Implications
Future research directions highlighted in the literature include:
- Algorithmic acceleration or neural-approximate variants of FBBS and path attribution, reducing extractive cost (Niwa et al., 28 Feb 2025).
- Extending reasoning-aware unlearning to open-ended CoT generation, multi-hop reasoning, and further multimodal and multi-step architectures.
- Development of formal bounds for the erasure of intermediate reasoning traces and extension of dual-metric evaluation protocols distinguishing surface suppression from persistent latent traces (Mahmood et al., 15 Jan 2026).
- Exploration of curriculum-based or multi-stage unlearning strategies to progressively restructure belief and knowledge hierarchies with minimal parameter drift or utility loss.
- Integration of human or automated fact-validation in the belief extraction and classification pipeline.
A plausible implication is that as models become more entangled and capable, truly durable unlearning will demand increasingly sophisticated, reasoning-localized interventions—echoing a broader trend away from surface behavioral control and toward mechanistic alteration of internal representational flows.