Representation Misdirection Unlearning (RMU)
- RMU is a class of techniques that steer latent representations to selectively forget undesired information while maintaining core functionality.
- Techniques such as standard, adaptive, and feature-selective RMU modulate intermediate activations by targeting specific geometrical directions.
- Empirical studies reveal effective forgetting with minimal impact on utility, though challenges remain in ensuring irreversible and robust unlearning.
Representation Misdirection Unlearning (RMU) is a class of machine unlearning techniques in deep neural networks—especially LLMs and multimodal models—that achieve selective forgetting by purposefully steering the internal representations of forget-set inputs toward specific target directions in latent space. Rather than attempting to erase knowledge solely by adjusting weights or manipulating model outputs, RMU directly operates on the geometry of intermediate activations, seeking to disrupt, ablate, or repurpose the latent manifold encoding unwanted information, while stabilizing features on the retain set. Recent advances in RMU have elucidated its mechanistic underpinnings, generalization properties, and implications for both robust unlearning and controllable behavior modulation in foundation models (Dang et al., 29 Jan 2026).
1. Theoretical Foundations and Core Mechanisms
RMU is grounded in the premise that the representations of key concepts in LLMs and related models can be organized geometrically within high-dimensional hidden spaces. The foundational RMU objective consists of two terms: a forget loss that drives the representation of a forget-set input toward a “misdirection” target (typically a fixed random vector, a structured “concept direction,” or a manifold anchor), and a retain loss that regularizes the representations of retain-set inputs to remain close to those of a reference (usually the frozen, pretrained model):
where is the chosen misdirection target, often with a random unit vector and a norm-matching scale (Dang et al., 29 Jan 2026, Huu-Tien et al., 2024). By systematically perturbing only the forget-representations, RMU suppresses a model’s ability to utilize, reconstruct, or recall unwanted knowledge with minimal disruption to unrelated capabilities.
A key formal advance is the linear representation hypothesis, where high-level concepts (truthfulness, sentiment, refusal, context cues) correspond to 1D directions in hidden space. RMU generalizes to both additive interventions (steering along a concept vector ) and ablative interventions (removing the 0 component), allowing for structured behavioral controls beyond simple forgetting.
2. Algorithmic Variants and Practical Implementations
Modern RMU encompasses a range of algorithmic designs:
a) Standard and Adaptive RMU
In basic RMU, the misdirection strength 1 is fixed. Adaptive RMU refines this by matching the steering coefficient to the norm of the frozen representation: 2, enabling effective unlearning even in deeper layers where hidden norms are large (Huu-Tien et al., 2024, Dosajh et al., 19 Jun 2025). Algorithms select a few layers (often early-to-mid) as intervention points due to the locality of semantic encoding and observed effects on downstream outputs.
b) Feature-Selective and Directional RMU (SRMU)
SRMU introduces an activation importance mask 3 and a polarity-controlled vector 4, so that only critical feature dimensions are misdirected for forget samples: 5. 6 is computed as a function of activation contrasts between forget and retain sets, sharply localizing the perturbation and enabling robust unlearning under high entanglement (Chen et al., 18 Dec 2025).
c) Concept-Vector and Meta-Learned RMU
Instead of random directions, concept-vector RMU identifies a linear probe aligned with a conceptual attribute, such as truth or refusal. Projections along this direction achieve not only forgetting but also behavioral modulation (e.g., promoting refusal or boosting in-context learning) (Dang et al., 29 Jan 2026). Meta-learning, e.g., via REINFORCE-style updates of tradeoff weights, removes the need for manual tuning of forget/retain coefficients (Atil et al., 26 May 2026). Cosine-distance losses focus the intervention on angular disentanglement rather than Euclidean displacement.
d) Multi-layer and Cross-Task Extensions
Variants such as Erase at the Core (EC) and reasoning-aware RMU (R²MU) extend RMU to deep supervision across multiple layers, and, in LRMs, integrate trace-misdirection on every step in a chain-of-thought trajectory (Wang et al., 15 Jun 2025, Lee et al., 5 Feb 2026).
3. Empirical Results and Benchmarks
Across diverse tasks and architectures, RMU and its variants consistently demonstrate:
- Rapid and effective forgetting: On WMDP hazardous-knowledge QA, accuracy drops from ≈64% to random-chance (≈25%); on biology/PII benchmarks, accuracy on forget-sets after RMU typically falls below 20% (Dang et al., 29 Jan 2026, Doshi et al., 2024).
- Minimal utility loss: Retain-set performance (e.g., MMLU, MT-Bench, COCO CLIP similarity) generally remains within a few points of the original model (Dang et al., 29 Jan 2026, Lee et al., 23 Feb 2026).
- Robustness to naive attacks: RMU provides defense against many prompt-based and gradient-based (jailbreak) attacks due to decoupling of input tokens and post-misdirection representations (Huu-Tien et al., 2024).
SRMU achieves state-of-the-art tradeoffs even in high-entanglement settings and outperforms prior global/noise-based unlearning approaches.
| RMU Variant | Forget QA % ↓ | Retain Utility % ↑ | Robustness Notes |
|---|---|---|---|
| Vanilla RMU | 10–15 | 57–58 | Robust to black-box prompts/jailbreak |
| Adaptive RMU | 10–13 | 57–58 | Effective at all layers |
| Feature-SR (SRMU) | 25–38 | 52–57 | Best for entangled forget/retain |
| Concept-Vector | <25 | ≈58 | Adds controllability/capability |
Empirical results confirm that RMU enables precise behavioral modulation, such as shifting sentiment or refusal properties or amplifying in-context learning (Dang et al., 29 Jan 2026).
4. Risks, Limitations, and Bypasses
Despite the success of RMU-class methods in benchmark settings, several studies have revealed significant vulnerabilities:
- Recoverability: Fine-tuning the unlearned model on a small number (10–50) of unrelated retain-set examples or even on unrelated language data can completely recover pre-unlearning capabilities on the forget-set (Łucki et al., 2024, Doshi et al., 2024). This indicates that RMU frequently “hides” rather than irrevocably erases the targeted knowledge.
- Prompt engineering attacks: Five-shot prompting, simple rephrasings, and representation orthogonalization at inference can restore the original behavior, highlighting the shallow nature of many RMU-induced changes.
- Superficial forgetting: Internal representations often remain highly discriminative for the forgotten classes or knowledge (as measured by linear probing, CKA, or k-NN accuracy), with only classifier-level misalignment preventing output-level recall (“feature-classifier misalignment”) (Gao et al., 9 Apr 2026).
- Feature-level attacks: Adaptive adversaries using activation subtraction or pruning can re-enable forgotten behaviors with minimal effort.
Consequently, current RMU instantiations do not satisfy strong irreversibility guarantees, and output-only evaluations are insufficient (Łucki et al., 2024, Gao et al., 9 Apr 2026).
5. Extensions: Federated, Multimodal, and Security Applications
RMU frameworks have been adapted well beyond single-model, single-task settings:
- Vertical Federated Unlearning: REMISVFU collapses the encoder outputs of a forgetting party in a split VFL system to a constant random anchor, incorporates coordinated gradient projection for retain/forget losses, and achieves near-optimal suppression of membership inference and backdoor attack success at minimal utility cost (Wu et al., 11 Dec 2025).
- Multimodal/AR-LMM Security: Fisher-weighted RMU (F-RMU) within the UNSEEN defense stack uses integrated Fisher scores to localize parameter updates, sparsely injects misdirection into sensitive profile features, and demonstrates a >60% reduction in measured AR-based social engineering vulnerability, while preserving benign utility (Yu et al., 25 Apr 2026).
- Diffusion and T2I Models: High-Level Representation Misdirection (HiRM) enables concept erasure in text-to-image models by localizing updates to early self-attention layers of the text encoder and steering target-concept representations toward random or semantic vectors. This allows removal of visual concepts (e.g. nudity, style, object) with minimal collateral effect and synergizes with denoising-based erasure (Lee et al., 23 Feb 2026).
6. Applications, Behavioral Control, and Theoretical Insights
RMU is notable not only for selective unlearning, but also for fine-grained control of high-level behaviors:
- Additive or ablative RMU along concept vectors can modulate truthfulness (BLEU, ROUGE, MC accuracy), sentiment (SST-2), refusal behavior, and even boost in-context learning accuracy from near zero to >70% (Dang et al., 29 Jan 2026).
- Behavioral control is modular: targeted steering amplifies desired capabilities or instills safety-oriented refusal; the same mechanism can, if misused, implement hidden backdoors via linear concept directions (Dang et al., 29 Jan 2026).
- Theoretical analyses reveal that RMU reduces token confidence by randomizing pre-final activations, thereby lowering the model’s ability to generate correct outputs for targeted knowledge and increasing robustness against gradient-based attacks (Huu-Tien et al., 2024).
7. Open Problems and Prospects
Several active research challenges persist:
- Irreversibility and Certification: Existing RMU strategies lack formal guarantees of inaccessibility after unlearning. Research is ongoing toward adversarially robust, weight-level erasure and certified unlearning—especially under adaptive attacks (Łucki et al., 2024, Gao et al., 9 Apr 2026).
- Representation-level erasure: Most RMU methods affect only output-layer behavior; representation-level metrics such as CKA, MI, and linear probes remain high. Approaches incorporating multi-layer contrastive, supervised, or class-mean features are under investigation (Lee et al., 5 Feb 2026, Gao et al., 9 Apr 2026).
- Generalization to complex tasks: Extensions to chain-of-thought reasoning (R²MU), multimodal T2I erasure, and continual/groupwise unlearning represent open avenues (Wang et al., 15 Jun 2025, Lee et al., 23 Feb 2026, Yu et al., 25 Apr 2026).
- Parameter-efficient and federated protocols: Gradient alignment, sparse updates, and federated variants (e.g., projection and anchor methods) minimize retraining time and disruption to distributed or edge systems (Wu et al., 11 Dec 2025, Yu et al., 25 Apr 2026).
RMU and its derivatives have established a new paradigm for targeted, modular, and semantically structured unlearning in foundation models. Their dual nature—as both a tool for rapid behavior shaping and a locus of new security risks—underscores the need for fine-grained, adversarially robust, and theoretically grounded approaches as the field evolves (Dang et al., 29 Jan 2026).