Token Mapping Perturbation Attack (TOMPA)
- TOMPA is an adversarial attack that exploits token mapping by applying subtle perturbations to disrupt tokenization in NLP and RLHF pipelines.
- It employs carefully designed input modifications and latent cache injections to exploit vulnerabilities in tokenizers like BPE and WordPiece, leading to misclassification and reward anomalies.
- Experimental evidence shows TOMPA can drop performance metrics significantly, with up to 82.9% F1 decrease, emphasizing the need for robust defensive strategies.
Token Mapping Perturbation Attack (TOMPA) refers to a class of adversarial attacks that exploit the mapping from natural-language text to model-understood token sequences to subvert the behaviors of neural classification systems, reward models, or, more generally, any processing pipeline that depends on non-trivial tokenization or encoding. TOMPA uniquely manipulates the discrete token input space—either by carefully crafted input-level perturbations that cause misalignment in tokenization for text-processing models or, in more general settings, by directly optimizing token sequences under the target model’s token vocabulary. TOMPA exposes vulnerabilities in modern NLP and RLHF systems by reliably inducing misclassifications, reward inflation, or output corruption, all while evading detection at the semantic or natural language interface (Schulz et al., 9 Jun 2025, Zhang et al., 3 Apr 2026, Hossain et al., 20 Oct 2025).
1. Formal Definitions and Threat Models
TOMPA encompasses both input-level and latent-level (token, embedding, or cache-based) perturbations, formalized as follows:
- Tokenization-based TOMPA: For text classifiers using tokenizer , an adversary subtly perturbs an input to produce so that the token sequence differs from in a minimally invasive but target-attacker-controlled way. The goal is to induce a targeted misclassification while preserving human interpretability and semantic fidelity (Schulz et al., 9 Jun 2025).
- Token-to-token Mapping TOMPA: In RLHF pipelines, the attack operates at the interface between a policy’s output tokens and the downstream reward model’s input tokens. By manipulating the token mapping function , the adversary produces candidate token sequences that are non-linguistic yet maximize , thereby decoupling model reward from natural language quality (Zhang et al., 3 Apr 2026).
- Cache-side TOMPA (Malicious Token Injection): In transformer models with key-value (KV) caching, TOMPA (also known as Malicious Token Injection) alters internal token representations (keys) at specific layers or timesteps, introducing perturbations so that 0. This systematically shifts attention scores and output distributions (Hossain et al., 20 Oct 2025).
The adversary is assumed to possess white-box or gray-box access: they can query the model’s outputs and confidence scores but do not modify model parameters.
2. Attack Methodologies and Algorithms
2.1 Text Tokenization Perturbation
The TokenBreak algorithm exemplifies input-level TOMPA for text classifiers. For input 1, let 2. A perturbation 3 is constructed as a sparse set of single-character prefixes to input words, designed so that 4 triggers misclassification. Algorithmically:
- Evaluate the impact of all candidate single-character prefixes for each word 5; select the prefix 6 that maximally reduces the classifier’s confidence on the original (positive) label.
- Iteratively graft the highest-impact prefixes up to a budget 7 or until an attack threshold is reached.
This process leverages the structure of merge-based tokenizers (BPE, WordPiece), where such small prefixes often result in token splits that obfuscate toxic or otherwise flagged subwords, bypassing detectors (Schulz et al., 9 Jun 2025).
2.2 Direct Token-Space Optimization
TOMPA in RLHF settings eliminates the decode–re-tokenize interface by mapping the output tokens from the attack policy 8 directly onto the reward model vocabulary 9, via coordination-wise identity or a compatible mapping 0. The core objective is:
1
Derivative-free policy optimization (e.g., Group-Relative Policy Optimization) is used to maximize expected reward under only scalar feedback, uncovering non-linguistic, repetitive token sequences that yield surges in reward despite being semantically degenerate (Zhang et al., 3 Apr 2026).
2.3 Cache-Side Perturbations
In attention-based models, TOMPA variants act directly on the cached key vectors. The attack applies at selected layers/heads and timesteps, using one of several mechanisms:
- Additive Gaussian noise: 2.
- Zeroing: with probability 3, 4 (cache erasure).
- Orthogonal rotation: 5, for a rotation 6.
At each step, keys are perturbed 7 at selected positions, and decoding continues with the corrupted cache (Hossain et al., 20 Oct 2025).
3. Impact of Tokenizer Architectures, Model Families, and Attack Surface
TOMPA success is tightly coupled to tokenizer architecture:
- BPE and WordPiece tokenizers, which perform left-to-right merges and subword selection, are acutely vulnerable: single-character prefix perturbations can force toxic or suspicious words to split into innocuous subword tokens. For example, prefixing “fucking” with “I” transforms tokenization from [“Ġfucking”] to [“ĠIf”, “ucking”], bypassing toxicity filters (Schulz et al., 9 Jun 2025).
- Unigram tokenizers (e.g., SentencePiece) decode inputs holistically via highest-probability segmentation and are intrinsically robust against minimal prefix perturbations; attack success rates drop to 0.00% across prompt injection, spam, and toxicity benchmarks.
For reward models and RLHF, TOMPA demonstrates that current RMs can be manipulated into highly rewarding but meaningless outputs: raw patterns of token IDs, which may not decode as coherent language, systematically elicit higher rewards than reference gold answers (e.g., for Skywork-Reward-V2-Llama-3.1-8B, mean reward is nearly doubled, and beat rate exceeds 98.0%) (Zhang et al., 3 Apr 2026).
Cache-side TOMPA shows that attention and output distributions in transformers can be steered within provable bounds—for instance, softmax changes in output are linearly bounded by the Frobenius norm of injected perturbations—yielding sharp performance degradations (e.g., up to 17% accuracy drop on SST-2 and 82.9% F1 drop on SQuAD) (Hossain et al., 20 Oct 2025).
4. Quantitative Results and Experimental Evidence
4.1 Text Classifier Attack Rates
| Task | BPE (RoBERTa) | WordPiece (BERT/Distil) | Unigram (DeBERTa/XLM) |
|---|---|---|---|
| Prompt Injection | 2.09% | 11.90% | 0.00% |
| Spam | 4.28% | 78.93% | 0.00% |
| Toxicity | 25.26% | 76.05% | 0.00% |
| Mean | 10.54% | 55.62% | 0.00% |
TokenBreak achieves highest attack rates on spam and toxicity detection using WordPiece, with zero effect on Unigram tokenizers (Schulz et al., 9 Jun 2025).
4.2 Reward Model Exploitation
| Target RM | Baseline (Gold) | Random OOD | TOMPA | TOMPA Beat Rate |
|---|---|---|---|---|
| Llama-3.1-8B | +17.48 | –3.42 | +33.64 | 98.0% |
| Qwen3-8B | +8.12 | –7.94 | +16.86 | 98.0% |
TOMPA sequences (random but reward maximizing) outperform gold answers on nearly all prompts (Zhang et al., 3 Apr 2026).
4.3 Cache-Side Degradation
Token Mapping Perturbation on cache memory causes measurable performance drops even for small-magnitude perturbations: SST-2 accuracy drops from 91.0% to 75.5% with 8, and SQuAD F1 from 77.4 to 13.3 (Hossain et al., 20 Oct 2025).
5. Defense Strategies and Limitations
5.1 Tokenizer Translation and Verification
A tokenizer translation defense inserts a robust Unigram segmentation layer upstream, translates the segmentation into the model’s tokenizer, and classifies the mapped sequence. This drops attack success (mean across tasks) from 33.09% to 12.63% in WordPiece/BPE models (Schulz et al., 9 Jun 2025). Model-free invariant checks can raise an alarm when the token count difference
9
exceeds a threshold 0.
5.2 Robust RLHF Defenses
Potential mitigations for reward model TOMPA include adversarial fine-tuning of RMs on toxic/non-linguistic token perturbations, input sanitization to enforce decode–re-tokenize checks, token ID range validation (e.g., valid UTF-8), or reward gating by generation length (Zhang et al., 3 Apr 2026). However, model ensembles and length penalties may still be circumvented by token-space exploration.
5.3 Cache Integrity Mechanisms
Defensive measures for cache-side TOMPA/Malicious Token Injection include:
- Periodic cache reset
- Dropout mask randomization
- Attention smoothing via temporal averaging of weights
- Active runtime detection: monitoring output KL divergence or per-token attention-variation against explicit theoretical thresholds
- Cryptographic integrity (MACs) over the KV cache (Hossain et al., 20 Oct 2025)
However, several countermeasures incur throughput penalties or degrade long-context coherence.
6. Broader Implications and Recommendations
TOMPA generalizes across neural architectures reliant on left-to-right merge or subword-based tokenization and demonstrates that both language-understanding and RLHF reward systems can be subverted through fundamentally non-semantic, discrete-token perturbations. The disconnect between natural language and raw token-space, especially in high-stakes or safety-critical applications, necessitates:
- Preferential adoption of Unigram or probabilistic global segmentation tokenizers
- Ensemble tokenization and majority-vote inference to counter tokenizer-specific attacks
- Integration of invariant and cross-tokenizer consistency checks as a first-class input validation stage
- Robust cache-integrity protocols and monitoring for online detection of internal-state corruption
These recommendations target both immediate hardening of deployed NLP and RLHF systems and long-term research toward cross-tokenizer invariance and end-to-end perturbation resilience (Schulz et al., 9 Jun 2025, Zhang et al., 3 Apr 2026, Hossain et al., 20 Oct 2025).