Representation Misdirection (RM)
- Representation misdirection (RM) is a technique that intentionally shifts internal representations away from natural meanings to control semantic, behavioral, and epistemic outputs.
- RM is applied in neural models for adversarial attacks, selective unlearning, and privacy enhancement, while in logic it models deceptive belief and observation updates.
- Targeted interventions in RM, using additive and ablative methods, achieve high success rates in controlled semantic shifts but also introduce vulnerabilities to recovery and backdoor exploits.
Representation misdirection (RM) is a class of techniques and formal apparatuses that operate by intentionally steering the internal representations—whether in artificial systems or epistemic agents—away from their natural meaning, such that downstream semantic, behavioral, or epistemic properties are controllably and selectively manipulated. In contemporary neural models, RM encompasses both adversarial attacks that hijack semantic trajectories in model layers and principled interventions for selective forgetting or knowledge suppression. In logic and multi-agent systems, RM is analyzed as an action or protocol yielding systematic misrepresentations in an agent’s beliefs or perceptions.
1. Formal Definitions and Core Principles
The concept of representation misdirection (RM) arises in both dynamic-epistemic logic and machine learning. In the formal epistemic framework, RM is defined as the intentional action of causing an agent or group of agents to form an inaccurate representation—of belief or observation—via verbal (e.g., lying) or visual (e.g., sleight of hand) means. The formal language introduces observation atoms , observation-based actions (e.g., ), and action models specifying how misdirection updates belief states and observability (Icard et al., 2024).
In neural machine learning, RM comprises methods that alter the latent representation of inputs at a hidden layer , typically to achieve one of the following: (1) induce the model to interpret a benign or alternative input as a disallowed/harmful one, bypassing surface-level safeguards; (2) erase knowledge or capabilities by steering forget-related activations toward or away from selected targets; (3) sever data-party associations in federated or split learning (Yona et al., 3 Dec 2025, Dang et al., 29 Jan 2026, Chen et al., 18 Dec 2025, Wu et al., 11 Dec 2025).
2. Mathematical and Algorithmic Frameworks
In modern LLMs, RM operates by directly perturbing or collapsing the hidden activations:
- In-Context RM (Doublespeak attack): Repeated in-context substitution of a harmful token with a benign causes the internal contextual vector to morph toward in deeper layers. Similarity is quantified by cosine similarity:
Under attack, this metric rises from ≈0.1 in early layers to >0.8 in deep layers, so that blocked semantics are reintroduced through innocent tokens (Yona et al., 3 Dec 2025).
- Target-Vector-Based Unlearning: Here, RM selects a one-dimensional direction or concept vector , derived by fitting a logistic regression in the latent space separating forget- and retain-set activations. Interventions are either additive () or ablative (). This linear geometry is supported by the linear representation hypothesis, which asserts that logit odds for high-level concepts are linear in hidden activations (Dang et al., 29 Jan 2026).
- Federated/Vertical RM: In split architectures, RM collapses all features of a forget-party to a randomly chosen unit anchor, , decoupling the statistical link between private features and joint outputs (Wu et al., 11 Dec 2025).
- Selective RM in Entangled Representations: For settings where forget and retain distributions overlap, RM is made feature-selective via an importance map and direction vector , so that only dimensions most implicated in the forget-task are perturbed: (Chen et al., 18 Dec 2025).
3. Applications: Attacks, Unlearning, and Behavioral Steering
Neural Adversarial Attacks
The Doublespeak attack demonstrates RM as an attack on LLMs’ in-context alignment defenses. By systematically replacing a harmful string with a benign token across in-context examples, semantic convergence is induced at the representation level, fooling refusal and safety filtering systems that only monitor surface tokens or early-layer activations. Attack success rates (ASR) exceed 70% on modern LLMs even with minimal context (Yona et al., 3 Dec 2025).
Machine Unlearning and Privacy
RM forms a foundation for efficient machine unlearning by acting in representation space rather than parameter space. By steering representations of forget-set examples toward (additive) or away from (ablative) high-level concept vectors, models robustly suppress performance on prescribed inputs while minimally degrading retain-set utility. Recent works empirically validate controllable side effects—truthfulness boosting, sentiment control, refusal adjustment—arising from such interventions (Dang et al., 29 Jan 2026).
In federated/split learning, RM underlies protocols for client-level unlearning—collapsing a party’s encoder outputs onto a random constant anchors privacy, suppresses backdoor attack success, and limits utility loss to ≈2.5% points, outperforming retraining-based and gradient-ascent baselines (Wu et al., 11 Dec 2025).
Selective Knowledge Removal
Under entangled conditions, only targeted activation dimensions are perturbed, as in SRMU: dynamic importance maps localize forgetting to features critical to the forget-set while preserving capabilities on retain-set domains. This enables robust safety-driven knowledge removal while preserving baseline accuracy under entanglement up to 30% (Chen et al., 18 Dec 2025).
4. Formal Logic of Representation Misdirection
In dynamic epistemic logic, RM is axiomatically formalized as event/action models updating an agent’s belief and observation structure. The action language introduces:
- Verbal RM: Actions of the form simulate lies by updating an agent’s belief without changing ground truth.
- Visual RM: Actions modify observation atoms, producing false observations (e.g., classic sleight-of-hand magic, “French Drop” trick).
- Mixed/Compound RM: Sequences of verbal and visual steps systematically update an agent’s knowledge base, supporting both simple and higher-order misdirection (representing another’s beliefs about observations) (Icard et al., 2024).
A complete axiom system underpins these models, supporting reductions, compositional reasoning, and extensions to rich multi-agent, first-order, or action-planning settings.
5. Evaluation, Empirical Results, and Benchmarks
Empirical evaluation of RM techniques spans attack, forgetting, and privacy settings:
- Attack Benchmarks: In-context RM (Doublespeak) yields attack success rates of 74–92% (open-source LLMs) and 31–16% (closed-source APIs), measured via refusal and harmful-response judges (StrongReject) (Yona et al., 3 Dec 2025).
- Unlearning Benchmarks: Linear and selective RM methods achieve targeted WMDP forgetting (WMDP-Avg accuracy reduced to 27.2%) while retaining state-of-the-art MMLU generality (57.1%) (SRMU and related baselines). Under entanglement, SRMU consistently outperforms logit-level or distillation baselines (Chen et al., 18 Dec 2025).
- Federated Settings: RM-based federated unlearning (ReMisVFU) approaches retrain-level backdoor suppression and clean accuracy, outperforming standard and knowledge-distillation baselines by significant margins and in reduced runtime (Wu et al., 11 Dec 2025).
- Behavioral Control: RM enables not just forgetting but controllable model steering (truthfulness, sentiment, refusal, in-context learning capacity) (Dang et al., 29 Jan 2026).
6. Limitations, Risks, and Prospective Defenses
- Adversarial Risks: RM exposes vulnerabilities to hidden backdoors, stealthy jailbreaks, and the possibility that suppressed knowledge can be almost fully recovered via finetuning or representation manipulation.
- Limits of Defenses: Current LLM alignment methods, focused on surface-level tokens or static refusal axes, are ineffective against representation hijacking; time-of-check vs. time-of-use issues abound when semantic shifts take place only in late layers (Yona et al., 3 Dec 2025).
- Mitigation Strategies: Defense recommendations include multi-layer representation-level monitoring, contrastive anchoring to immunize against semantic aliasing, and dynamic, depth-wise refusal axes.
- Mechanism Versatility: RM enables transparent, certifiable behavior control and targeted capability amplification, but exposes systems to new classes of epistemic and behavioral side effects (Chen et al., 18 Dec 2025, Yona et al., 3 Dec 2025, Dang et al., 29 Jan 2026).
7. Scope Across Domains and Generalizations
RM’s conceptual foundations are agnostic to substrate: it can steer artificial semantic trajectories, modify epistemic states of agents in multi-agent systems, and even model biological or visual misdirection (as in military camouflage or magic trick analysis) (Icard et al., 2024).
In logic, RM provides a unifying account for both verbal and non-verbal deception, with extensible axiomatics and expressive event models. In machine learning, the approach is being generalized to multi-layer, attention-based, and adversarially robust unlearning or alignment procedures, with applications in privacy, safety, and controllable NLP model governance (Wu et al., 11 Dec 2025, Chen et al., 18 Dec 2025, Dang et al., 29 Jan 2026).
Key References:
- In-context representation hijacking and Doublespeak: (Yona et al., 3 Dec 2025)
- Linear geometry and side-effects in RM/unlearning: (Dang et al., 29 Jan 2026)
- Feature-selective RM for entangled forgetting: (Chen et al., 18 Dec 2025)
- Federated RM and client-level unlearning: (Wu et al., 11 Dec 2025)
- Dynamic logic of RM and epistemic modeling: (Icard et al., 2024)