Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token Mapping Perturbation Attack (TOMPA)

Updated 9 May 2026
  • TOMPA is an adversarial attack that exploits token mapping by applying subtle perturbations to disrupt tokenization in NLP and RLHF pipelines.
  • It employs carefully designed input modifications and latent cache injections to exploit vulnerabilities in tokenizers like BPE and WordPiece, leading to misclassification and reward anomalies.
  • Experimental evidence shows TOMPA can drop performance metrics significantly, with up to 82.9% F1 decrease, emphasizing the need for robust defensive strategies.

Token Mapping Perturbation Attack (TOMPA) refers to a class of adversarial attacks that exploit the mapping from natural-language text to model-understood token sequences to subvert the behaviors of neural classification systems, reward models, or, more generally, any processing pipeline that depends on non-trivial tokenization or encoding. TOMPA uniquely manipulates the discrete token input space—either by carefully crafted input-level perturbations that cause misalignment in tokenization for text-processing models or, in more general settings, by directly optimizing token sequences under the target model’s token vocabulary. TOMPA exposes vulnerabilities in modern NLP and RLHF systems by reliably inducing misclassifications, reward inflation, or output corruption, all while evading detection at the semantic or natural language interface (Schulz et al., 9 Jun 2025, Zhang et al., 3 Apr 2026, Hossain et al., 20 Oct 2025).

1. Formal Definitions and Threat Models

TOMPA encompasses both input-level and latent-level (token, embedding, or cache-based) perturbations, formalized as follows:

  • Tokenization-based TOMPA: For text classifiers using tokenizer TT, an adversary subtly perturbs an input xx to produce xx' so that the token sequence T(x)T(x') differs from T(x)T(x) in a minimally invasive but target-attacker-controlled way. The goal is to induce a targeted misclassification f(T(x))=ytargetf(T(x')) = y_{\text{target}} while preserving human interpretability and semantic fidelity (Schulz et al., 9 Jun 2025).
  • Token-to-token Mapping TOMPA: In RLHF pipelines, the attack operates at the interface between a policy’s output tokens and the downstream reward model’s input tokens. By manipulating the token mapping function Φ:VπVR\Phi: V_\pi \to V_R, the adversary produces candidate token sequences oo that are non-linguistic yet maximize Rϕ(x,Φ(o))R_\phi(x, \Phi(o)), thereby decoupling model reward from natural language quality (Zhang et al., 3 Apr 2026).
  • Cache-side TOMPA (Malicious Token Injection): In transformer models with key-value (KV) caching, TOMPA (also known as Malicious Token Injection) alters internal token representations (keys) at specific layers or timesteps, introducing perturbations δj\delta_j so that xx0. This systematically shifts attention scores and output distributions (Hossain et al., 20 Oct 2025).

The adversary is assumed to possess white-box or gray-box access: they can query the model’s outputs and confidence scores but do not modify model parameters.

2. Attack Methodologies and Algorithms

2.1 Text Tokenization Perturbation

The TokenBreak algorithm exemplifies input-level TOMPA for text classifiers. For input xx1, let xx2. A perturbation xx3 is constructed as a sparse set of single-character prefixes to input words, designed so that xx4 triggers misclassification. Algorithmically:

  • Evaluate the impact of all candidate single-character prefixes for each word xx5; select the prefix xx6 that maximally reduces the classifier’s confidence on the original (positive) label.
  • Iteratively graft the highest-impact prefixes up to a budget xx7 or until an attack threshold is reached.

This process leverages the structure of merge-based tokenizers (BPE, WordPiece), where such small prefixes often result in token splits that obfuscate toxic or otherwise flagged subwords, bypassing detectors (Schulz et al., 9 Jun 2025).

2.2 Direct Token-Space Optimization

TOMPA in RLHF settings eliminates the decode–re-tokenize interface by mapping the output tokens from the attack policy xx8 directly onto the reward model vocabulary xx9, via coordination-wise identity or a compatible mapping xx'0. The core objective is:

xx'1

Derivative-free policy optimization (e.g., Group-Relative Policy Optimization) is used to maximize expected reward under only scalar feedback, uncovering non-linguistic, repetitive token sequences that yield surges in reward despite being semantically degenerate (Zhang et al., 3 Apr 2026).

2.3 Cache-Side Perturbations

In attention-based models, TOMPA variants act directly on the cached key vectors. The attack applies at selected layers/heads and timesteps, using one of several mechanisms:

  • Additive Gaussian noise: xx'2.
  • Zeroing: with probability xx'3, xx'4 (cache erasure).
  • Orthogonal rotation: xx'5, for a rotation xx'6.

At each step, keys are perturbed xx'7 at selected positions, and decoding continues with the corrupted cache (Hossain et al., 20 Oct 2025).

3. Impact of Tokenizer Architectures, Model Families, and Attack Surface

TOMPA success is tightly coupled to tokenizer architecture:

  • BPE and WordPiece tokenizers, which perform left-to-right merges and subword selection, are acutely vulnerable: single-character prefix perturbations can force toxic or suspicious words to split into innocuous subword tokens. For example, prefixing “fucking” with “I” transforms tokenization from [“Ġfucking”] to [“ĠIf”, “ucking”], bypassing toxicity filters (Schulz et al., 9 Jun 2025).
  • Unigram tokenizers (e.g., SentencePiece) decode inputs holistically via highest-probability segmentation and are intrinsically robust against minimal prefix perturbations; attack success rates drop to 0.00% across prompt injection, spam, and toxicity benchmarks.

For reward models and RLHF, TOMPA demonstrates that current RMs can be manipulated into highly rewarding but meaningless outputs: raw patterns of token IDs, which may not decode as coherent language, systematically elicit higher rewards than reference gold answers (e.g., for Skywork-Reward-V2-Llama-3.1-8B, mean reward is nearly doubled, and beat rate exceeds 98.0%) (Zhang et al., 3 Apr 2026).

Cache-side TOMPA shows that attention and output distributions in transformers can be steered within provable bounds—for instance, softmax changes in output are linearly bounded by the Frobenius norm of injected perturbations—yielding sharp performance degradations (e.g., up to 17% accuracy drop on SST-2 and 82.9% F1 drop on SQuAD) (Hossain et al., 20 Oct 2025).

4. Quantitative Results and Experimental Evidence

4.1 Text Classifier Attack Rates

Task BPE (RoBERTa) WordPiece (BERT/Distil) Unigram (DeBERTa/XLM)
Prompt Injection 2.09% 11.90% 0.00%
Spam 4.28% 78.93% 0.00%
Toxicity 25.26% 76.05% 0.00%
Mean 10.54% 55.62% 0.00%

TokenBreak achieves highest attack rates on spam and toxicity detection using WordPiece, with zero effect on Unigram tokenizers (Schulz et al., 9 Jun 2025).

4.2 Reward Model Exploitation

Target RM Baseline (Gold) Random OOD TOMPA TOMPA Beat Rate
Llama-3.1-8B +17.48 –3.42 +33.64 98.0%
Qwen3-8B +8.12 –7.94 +16.86 98.0%

TOMPA sequences (random but reward maximizing) outperform gold answers on nearly all prompts (Zhang et al., 3 Apr 2026).

4.3 Cache-Side Degradation

Token Mapping Perturbation on cache memory causes measurable performance drops even for small-magnitude perturbations: SST-2 accuracy drops from 91.0% to 75.5% with xx'8, and SQuAD F1 from 77.4 to 13.3 (Hossain et al., 20 Oct 2025).

5. Defense Strategies and Limitations

5.1 Tokenizer Translation and Verification

A tokenizer translation defense inserts a robust Unigram segmentation layer upstream, translates the segmentation into the model’s tokenizer, and classifies the mapped sequence. This drops attack success (mean across tasks) from 33.09% to 12.63% in WordPiece/BPE models (Schulz et al., 9 Jun 2025). Model-free invariant checks can raise an alarm when the token count difference

xx'9

exceeds a threshold T(x)T(x')0.

5.2 Robust RLHF Defenses

Potential mitigations for reward model TOMPA include adversarial fine-tuning of RMs on toxic/non-linguistic token perturbations, input sanitization to enforce decode–re-tokenize checks, token ID range validation (e.g., valid UTF-8), or reward gating by generation length (Zhang et al., 3 Apr 2026). However, model ensembles and length penalties may still be circumvented by token-space exploration.

5.3 Cache Integrity Mechanisms

Defensive measures for cache-side TOMPA/Malicious Token Injection include:

  • Periodic cache reset
  • Dropout mask randomization
  • Attention smoothing via temporal averaging of weights
  • Active runtime detection: monitoring output KL divergence or per-token attention-variation against explicit theoretical thresholds
  • Cryptographic integrity (MACs) over the KV cache (Hossain et al., 20 Oct 2025)

However, several countermeasures incur throughput penalties or degrade long-context coherence.

6. Broader Implications and Recommendations

TOMPA generalizes across neural architectures reliant on left-to-right merge or subword-based tokenization and demonstrates that both language-understanding and RLHF reward systems can be subverted through fundamentally non-semantic, discrete-token perturbations. The disconnect between natural language and raw token-space, especially in high-stakes or safety-critical applications, necessitates:

  • Preferential adoption of Unigram or probabilistic global segmentation tokenizers
  • Ensemble tokenization and majority-vote inference to counter tokenizer-specific attacks
  • Integration of invariant and cross-tokenizer consistency checks as a first-class input validation stage
  • Robust cache-integrity protocols and monitoring for online detection of internal-state corruption

These recommendations target both immediate hardening of deployed NLP and RLHF systems and long-term research toward cross-tokenizer invariance and end-to-end perturbation resilience (Schulz et al., 9 Jun 2025, Zhang et al., 3 Apr 2026, Hossain et al., 20 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Mapping Perturbation Attack (TOMPA).