Malicious Token Injection: Methods & Defenses
- Malicious Token Injection (MTI) is a set of adversarial techniques that covertly insert crafted tokens into systems to alter behavior, posing significant security risks.
- MTI exploits include invisible Unicode manipulations in source code, adversarial tokenization in LLMs, and targeted cache-side attacks to subvert system outputs.
- Countermeasures range from compiler defenses and layered security frameworks to probabilistic tokenization and cache robustness techniques, emphasizing integrated mitigation.
Malicious Token Injection (MTI) refers to a broad class of adversarial techniques in which an attacker strategically introduces specially crafted tokens into data, code, model contexts, or internal representations in order to subvert, hijack, or corrupt system behavior at the level of token processing. MTI has been demonstrated across domains including source code via invisible Unicode manipulations, prompt- and input-space attacks on LLMs, network protocol exploits, adversarial subword boundary manipulations, targeted cache corruption within transformers, and injection during data/model adaptation processes. Although techniques vary, the unifying feature of MTI is the covert introduction of tokens—sometimes visually indistinguishable or syntactically benign—that fundamentally alter model logic, data flow, execution results, or security posture.
1. Encoding Ambiguities and Source-/Input-Space Token Injection
A canonical manifestation of MTI is the "Trojan Source" attack (Boucher et al., 2021), wherein adversaries exploit Unicode bidirectional (Bidi) control characters—such as Left-to-Right Embedding (LRE, U+202A), Right-to-Left Override (RLO, U+202E), isolates (LRI, RLI), and terminators (PDI)—to reorder source code tokens logically while preserving visually innocuous structure. The attacker encodes these characters at carefully chosen places (within comments, string literals, or identifiers), so that human reviewers see innocuous code while compilers or interpreters parse a divergent token order. Examples include:
- Embedding an early
returnstatement within a Python docstring using Bidi controls, causing premature function exit invisible to reviewers. - Reordering multiline comments closure in C/C++ so conditionals are "commented out" visually but remain active.
- Injecting invisible or unbalanced Unicode tokens in string literals, thus causing string comparisons to fail and granting unauthorized access in JavaScript and related languages.
This attack generalizes to any context where encoding, display, and parsing are misaligned, establishing a foundational linkage between encoding-level ambiguities and MTI. These exploits are orthogonal to standard syntactic or semantic code review and undermine the assumption that the code as rendered to humans matches the input to compilers or run-time engines.
2. Tokenization Manipulation and Adversarial Token Boundary Attacks
MTI in modern neural models—including LLMs—includes attacks which leverage the non-uniqueness and multiplicity of valid tokenizations for a given input. Adversarial tokenization (Geh et al., 4 Mar 2025) demonstrates that for a given string (e.g., a harmful prompt), there exist exponentially many possible segmentations into subword tokens—each leading to potentially distinct model behaviors. Since LLM pipelines typically rely on a single, canonical (greedy) tokenization, post-training safety systems and alignment models are not robust to alternative tokenizations outside their observed distribution.
The AdvTok algorithm in (Geh et al., 4 Mar 2025) selects alternative tokenizations from the neighborhood of the canonical tokenization , optimizing for increased probability of generating an adversarial (harmful) response. Because surface string content is left unchanged, but token boundaries are shifted, this technique defeats pre-tokenization content filtering and increases the likelihood of bypassing safety classifiers and alignment shields.
Similar vulnerabilities are exploited in deployment scenarios where defenses use classification models tied to deterministic left-to-right tokenizers. As detailed in the TokenBreak attack (Schulz et al., 9 Jun 2025), prepending a single character to high-impact words causes token boundary fragmentation, thus attenuating detection confidence without altering the semantic content as interpreted by downstream LLMs or recipients. This attack is particularly effective against BPE and WordPiece tokenization stacks, whereas Unigram (SentencePiece) tokenizers demonstrate greater robustness.
3. Internal Model Manipulation: Cache-Side and Feature-Space Injection
Recent work conceptualizes MTI as an attack directly on internal transformer representations. (Hossain et al., 20 Oct 2025) presents a systematic framework for injecting malicious perturbations into the key–value (KV) cache used by transformer models to accelerate inference. Unlike prior approaches modifying inputs or model weights, this variant alters the transient attention memory by
- Adding Gaussian noise: , with
- Zeroing keys:
- Orthogonal rotations:
These corruptions can be precisely targeted by layer, timestep, and token position. Theoretical analysis relates cache perturbations to changes in unnormalized attention logits () and the induced shift in output distributions via softmax Lipschitz continuity (). Empirically, such attacks significantly increase the divergence between intended and actual model output distributions and undermine both standard generation and complex retrieval-augmented or agentic LLM pipelines.
4. Detection and Defense Mechanisms
A range of defense strategies against MTI have been developed:
- Compiler and Toolchain Defenses (Boucher et al., 2021): Disallowing or requiring balanced use of Unicode Bidi controls at compiler level; adding editor visualizations for invisible characters; scanning with regex or linters; build pipeline validations.
- Layered and Model-Integrated Defenses (Kokkula et al., 28 Oct 2024, Zou et al., 15 Oct 2025): Palisade and PIShield frameworks combine surface-level, ML-based, and intrinsic model feature-space detectors. PIShield leverages the residual stream vector at an “injection-critical” transformer layer to train a lightweight classifier (e.g., logistic regression), allowing for fast and highly effective inline MTI detection with near-zero false negatives and low computational overhead.
- Tokenization Robustness (Geh et al., 4 Mar 2025, Schulz et al., 9 Jun 2025): Employing probabilistic or non-deterministic tokenizers (Unigram/SentencePiece), translating Unigram tokens to model’s original vocabulary, or marginalizing safety scores over multiple tokenizations to reduce attack surface.
- Test-Time Defenses via Special Tokens (Chen et al., 10 Jul 2025): Deploying a small number of gradient-optimized DefensiveTokens prepended at inference, offering plug-and-play “switchable” security against prompt/token injection without permanent loss of model utility.
- Contrastive Learning and Influence Regularization (Zhang et al., 7 Apr 2025): SINCon regularizes node (token/message) influence in graph-based prediction problems, making targeted injection attacks less effective by uniformizing predictive leverage throughout the structure.
- Cache Robustness Mechanisms (Hossain et al., 20 Oct 2025): Cache Reset, Dropout Mask Randomization, and Attention Smoothing have been suggested as preliminary steps for mitigating cache-based MTI, although comprehensive coverage remains an open challenge.
5. Adversarial and Backdoor Code Injection
MTI includes backdoor data poisoning attacks, especially in code generation systems (Wu et al., 19 Aug 2024). Here, adversaries implant triggers (e.g., cue phrases indicating low user proficiency or specific flags) during pretraining/fine-tuning. When these triggers are present, the model adaptively inserts malicious code snippets or vulnerabilities—dynamically scaling the magnitude and visibility of injected code according to user characteristics. This type of MTI is validated via game-theoretic threat modeling, modified objective functions integrating both clean and backdoor data, and large-scale empirical testing (e.g., ASR 100% when trigger conditions are met).
6. Cross-Modal and Multimodal MTI
In agents integrating visual, textual, and external data streams, MTI generalizes to coordinated cross-modal prompt injection (Wang et al., 19 Apr 2025). Attacks like CrossInject optimize adversarial perturbations in both the visual embedding space (visual latent alignment) and the textual command space (textual guidance enhancement). By aligning the latent features of benign images with those representing malicious instructions, and using LLMs to craft guiding malicious commands, attackers increase the likelihood of agentic misbehavior (e.g., +26.4% attack success rate on autonomous agent tasks). The underlying principle holds: injecting “tokens”—whether visual feature perturbations or crafted textual directives—at vulnerable combination points in the agent pipeline, enables attack transfer across modalities.
7. Implications, Limitations, and Future Directions
MTI fundamentally undermines the assumption that faithfully rendered, tokenized, or cached representations are benign. Specific implications include:
- Supply Chain and Tooling Risks: Trojan Source demonstrates that invisible vulnerabilities can evade both human and automated code review, emphasizing the need for universal adoption of encoding-level and build/scan mitigations.
- Semantic and Syntactic Robustness: Adversarial tokenization and tokenizer-manipulation attacks show that securing only the model or only the tokenizer is insufficient; cross-component robustness is required.
- Efficiency and Detection Trade-offs: PIShield (Zou et al., 15 Oct 2025) shows low overhead and high detection rates are possible with model-internal, feature-based detectors, but questions remain about generalizing across architectures, languages, and adaptively optimized attacks.
- Emergent Attack Surfaces: Cache-based MTI (Hossain et al., 20 Oct 2025) brings attention to runtime, memory, and system-level vectors that have previously been unaddressed in LLM deployment strategies.
- Multi-Modal Integration: In agentic and RAG contexts, defenders must consider not only conventional input sanitization but also the cumulative effect of token injection through visual, textual, and latent data paths.
A plausible implication is that future robustness frameworks will need to integrate tokenization-invariant embeddings, dynamic runtime anomaly detection across all layers and modalities, and continuous model adaptation or defensive token insertion at inference. Standardization in encoding, validation, and protocol specification (e.g., as argued for DNS in (Jeitner et al., 2022)) is equally urgent. The ongoing expansion of transformer applications widens the attack surface for MTI, making systematic, layered, and cross-disciplinary defensive engineering a research priority.
This survey captures the multidimensional threat posed by malicious token injection—spanning encoding manipulation, adversarial tokenization, token-space and cache-level attacks, as well as their detection and mitigation. The landscape is evolving with increasing sophistication, motivating continued advances in secure model architectures, pipeline-level defenses, and rigorous empirical/theoretical analysis.