In-Context Representation Hijacking
- In-context representation hijacking is an attack method that manipulates deep, latent model representations, overriding user or system intents.
- It employs techniques like token substitution, adversarial prefix/suffix injection, and gradient-based prompt manipulation to bypass standard defenses.
- Defensive strategies include layer-wise semantic monitoring, privileged token tagging, and adversarial training to secure model safety and alignment.
In-context representation hijacking refers to a family of attacks and systemic vulnerabilities in LLMs, vision-LLMs (VLMs), and multi-modal systems, in which untrusted sequences, prefix segments, or adversarially crafted context elements—inserted at inference—induce the model to form internal representations that reflect attacker-specified meanings or instructions, in preference to the intended (user- or system-driven) behavior. This attack surface is notably distinct from standard jailbreaks, as it operates not by overt surface-level prompts, but by manipulating the latent representations constructed throughout the model’s depth, often bypassing input-layer or keyword-based defenses and impacting core model alignment and safety mechanisms.
1. Formal Definitions and Mechanisms
In the context of LLM systems, an in-context representation hijack occurs when an attacker constructs a composite input (context) such that the model’s hidden representations—across network layers—are biased, overwritten, or steered to encode the attacker’s semantic intent, even when explicit tokens corresponding to this intent are absent. Formally, for a model , input tokens , and hidden states at layer and token index , a successful hijack ensures
where is the desired (malicious) representation, even though does not contain the target tokens. The attack may work by substitution, adversarial suffixes, prefix injection, or structural context manipulations, as shown in “Doublespeak” (Yona et al., 3 Dec 2025), adversarial ICL (Zhou et al., 2023), and backdoor attacks (Zhao et al., 11 Jan 2024).
In multi-level instruction hierarchies, as formalized in AIR (Kariyappa et al., 25 May 2025), tokens are annotated by discrete privilege levels, and an injected sequence of malicious privilege can compromise the in-context computation such that the model’s output aligns with the attacker’s instruction rather than the user’s instruction .
In vision and multi-modal systems, hijacks can operate at the semantic level (image tokens, context images/captions, or visual-linguistic prefixes) (Bailey et al., 2023, Jeong, 2023). Adversarial inputs may overwrite the model’s contextual fusion mechanisms and force the system into attacker-specified behaviors (e.g., outputting disinformation, exfiltrating content, or overriding task labels).
2. Attack Paradigms and Technical Strategies
Multiple technical pathways for in-context representation hijacking have been demonstrated:
Token Substitution and Semantic Drift: The “Doublespeak” attack operates by systematically substituting a harmful token with a benign token across in-context examples, causing the deep-layer embedding to converge toward (Yona et al., 3 Dec 2025). This semantic drift remains undetected by input-layer checks and only manifests after several transformer layers.
Adversarial In-Context Suffixes/Prefixes: Attacks on ICL frameworks (Zhou et al., 2023, Zhao et al., 11 Jan 2024) append optimization-derived, page-innocuous token sequences (suffixes) to demonstration examples. The model’s attention mechanisms are subverted so cross-attention overwhelmingly focuses on these “hijacker” tokens, redirecting the output to the attacker's target label . The “ICLAttack” formalism accommodates both demonstration-content poisoning and prompt-template layer corruption.
Gradient-Based Prompt Injection & Content Injection: In contextual document workflows (Lian et al., 25 Aug 2025), the attacker injects a phrase into benign data, ensuring that , with downstream self-attention treating as a high-trust instruction. In web agents and memory-oriented LLM agents, this extends to “plan injection”—modification of external memory or stored plans, leading to task-representation corruption and hijack (Patlan et al., 18 Jun 2025).
Multi-Modal and Visual Triggers: In VLMs and ViTs, a minuscule amount of poisoned contextual data or visual trigger (e.g., a pixel patch) conditions the model to execute the adversary’s intended mapping. Techniques such as image-based behaviour matching and trigger-based activation upon environmental cues demonstrate high attack success rates for label-flip, information leakage, or denial-of-service behaviors (Bailey et al., 2023, Abad et al., 6 Sep 2024, Liu et al., 6 Aug 2024).
Semantic Graph Transformation: Structured graph-based attacks encode attacker intent in AMR/RDF or JSON graph forms, bypassing explicit content filtering by leveraging the model’s inability to surface-inspect such representations. Conversion to code-generation tasks shows further elevation of attack success (He et al., 17 Apr 2025).
3. Empirical Characterization and Quantitative Results
Table: Selected Experimental Results on In-Context Representation Hijacking
| Setting & Attack | Model(s) | Notable ASR / Degradation |
|---|---|---|
| Doublespeak (sem. overwrite) | Llama-3.3-70B-Instruct | 75% (k=1); up to 92% in some cases |
| Adversarial ICL (gradient) | GPT2-XL, LLaMA-7B | ASR = 100% (2-8 shot, SST-2/AG News) |
| ICLAttack (clean-label backdoor) | OPT 1.3–66B, Falcon 180B | Mean ASR ≈ 95%; Clean Acc drop <1.5% |
| Prompt-in-content injection | Grok 3, DeepSeek R1, Kimi | Unblocked: All 4 variants, 100% |
| AIR (Augmented IH Signal) | Llama-3.2-3B, Qwen | ASR drop: 38% → 4.1% (9.2× cut) |
| ViT Backdoor (visual trigger) | ViT-L, 6-task MIM | Up to 13× drop, 89.9% on target |
| Agent Plan Injection (memory) | Agent-E, Browser-use | Privacy exfil ASR = 53.3% (context-chained), 46% (plan inj.) |
Interpretability and ablation studies consistently reveal that hijacking tokens or triggers dominate late-layer representations, with attention/embedding norms concentrating on the attacker’s injected content and clean-label examples producing minimal collateral utility degradation (Zhou et al., 2023, Yona et al., 3 Dec 2025, Kariyappa et al., 25 May 2025).
4. Defenses, Mitigation Strategies, and Limitations
Representation-Level Defenses: Input-layer filtering or delimiter-based privilege tagging is insufficient. Stronger defenses augment token representations at each transformer layer with privilege signals (e.g., AIR, using ) to maintain privilege separation through the model’s depth. AIR achieves 9.2× lower ASR than input-only mechanisms (Kariyappa et al., 25 May 2025).
Adversarial Training: Robustness to adversarial demonstration/trigger attacks can be enhanced by fine-tuning or pre-training with adversarially perturbed inputs, minimizing worst-case loss under constrained perturbations. This approach yields a pronounced reduction in targeted attack error with minimal compromise of clean performance (Anwar et al., 7 Nov 2024).
Prompt Provenance and Segment Embeddings: Contextual provenance tagging (distinct embeddings per provenance class: system, user, data) and structured APIs for prompt composition are central to preventing prompt-in-content hijacks (Lian et al., 25 Aug 2025).
Semantic Backflow and Embedding Consistency Checks: Proposed but not widely implemented are monitors for latent semantic drift (e.g., cosine exceeding a threshold at any layer), and pattern detectors on AMR/RDF graphs to catch graph-based semantic attacks (He et al., 17 Apr 2025, Yona et al., 3 Dec 2025).
Limitations: Robustness claims are model- and attack-specific; in some cases, transferability is poor across architectures, seeds, or even related problem domains (Anwar et al., 7 Nov 2024). Many proposed defenses may only be effective for single-turn or non-agentic deployments; multi-turn, agentic, or code-generating settings (where memory and execution plans are mutable context) remain open challenges (Patlan et al., 18 Jun 2025).
5. Broader Implications and Open Problems
Cross-Modality and Agentic Pipelines: Representation hijacks are observable beyond text—vision, speech, and code modalities all display analogous vulnerabilities. Multi-modal and agentic systems, with external memories and procedural plans, add attack surfaces via context manipulation outside the model’s immediate surface input (Jeong, 2023, Abad et al., 6 Sep 2024, Liu et al., 6 Aug 2024, Patlan et al., 18 Jun 2025).
Security and Safety Alignment: These results highlight a mismatch between shallow (surface-level) safety checks and deep (semantic) model alignment. Attacks persistently evade static blacklists, regular expression filters, and refusal triggers implemented at or near the input layer. This architecture-principled vulnerability demands the development of representation-aware safety systems that monitor all computed layers during the forward pass, enforce invariants across semantic transformations, and correlate context with expected provenance and intent.
Pathways Forward: Robust in-context learning demands (1) layer-wise semantic monitoring; (2) per-token, per-layer privilege enforcement; (3) adversarial and distributionally robust training; and (4) architectural audits of context-handling, especially in agentic and multi-modal models. Formal guarantees against in-context hijacking—e.g., via certified bounds or probabilistic risk metrics on representation shifts—are an active and as-yet unresolved research frontier.
6. Comparative Summary Across Modalities and Defenses
| Attack Structure | Modality | Mechanism | Key Defense |
|---|---|---|---|
| Prefix/Suffix Hijack | Text/ICL | Demo suffix, substitution | AIR, adv. training |
| Plan/Context Injection | Web, agents | Memory or plan tampering | Memory integrity, contextify |
| Visual Trigger/Reprogram | Vision/VLM | Patch/embedding, context ctrl | MIM-robus., prompt filtering |
| Graph-based Semantic | Code, LLMs | AMR/RDF/JSON transform | Cross-repr. consistency |
| Attention Hijackers | LVLMs | Instruction-driven attention | Salience masking (AID) |
The persistent theme is that in-context representation hijacking constitutes a genus of attacks targeting the model’s core ability to form, manipulate, and act on rich, multi-layer contextual representations. This alignment-agnostic channel—manifest in text, images, plans, or code graphs—necessitates a fundamental reexamination of “trust boundaries” and semantic provenance in large, autonomous learning systems.