The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Published 31 Dec 2025 in cs.LG, cs.CL, and cs.CR | (2601.00065v1)

Abstract: The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single "breaker token" that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

Abstract PDF Chat (Pro)

Summary

The paper introduces a novel attack where a single engineered 'breaker token' remains inert in the donor model but triggers deterministic, high-impact behavior in the base after transplant.
It employs a dual-objective optimization that minimizes donor subspace overlap while aligning with a target high-salience direction, achieving near-perfect sequence emission rates in the base.
The study reveals that standard tokenizer transplant methods based on shared-basis, coefficient-reuse inherently expose LLM pipelines to subtle supply-chain vulnerabilities and adversarial manipulation.

Stealthy Sabotage of LLM Composition via Tokenizer Transplant: A Technical Analysis

Introduction and Context

The proliferation of open-weight LLMs has led to modular AI development pipelines in which model composition—via techniques such as weight merging, speculative decoding, and vocabulary expansion—enables rapid synthesis of new capabilities by remixing pretrained checkpoints. However, interoperability across heterogeneous model families (e.g., Llama-3, Mistral) critically relies on tokenizer transplant, an operator that aligns divergent token vocabularies to a unified embedding space. This step, performed without further model training, is widely assumed to be semantically neutral. "The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition" (2601.00065) rigorously demonstrates that this assumption exposes a fundamental supply-chain vulnerability: it is feasible to engineer a single "breaker token" that is inert in the donor, but that—after transplant—triggers deterministic, high-impact behaviors in the base model.

Threat Model, Attack Mechanism, and Formalization

The threat model targets any pipeline that leverages shared-basis tokenizer transplantation for model composition. The adversary patches a donor model tokenizer by appending a single "breaker token" with an embedding specifically optimized for post-transplant activation in the base model. Importantly, this attack is training-free: it does not require gradients or re-training, relying solely on the standard coefficient-reuse mechanism at the core of shared-basis transplant operators (e.g., orthogonal matching pursuit (OMP) as in mergekit).

The attack is formalized as a dual-objective optimization problem: the breaker token’s donor-side embedding must (i) minimize overlap with the principal components (high-variance subspace) of the donor, ensuring it is functionally inactive and statistically inconspicuous, and (ii) after transplant, reconstruct into a base-side embedding that aligns with a target, high-salience direction that can be arbitrarily chosen to trigger adverse or malicious behaviors. Notably, manipulations occur strictly at the level of the embedding geometry, and the exploit leverages the operator’s coefficient-reuse invariance.

Empirical Results and Quantitative Analysis

Asymmetric Realizability: The paper empirically validates the theoretical attack, demonstrating that across a wide range of donor/base pairs—including both lightweight (sub-3B) and standard-scale (7B–14B) models—a single breaker token can achieve near-perfect activation in the base (Sequence Emission Rate (SER) approaching 1.0) while remaining entirely quiescent (SER near 0) in the donor model.

Persistence and Robustness: The attack exhibits high resistance to remediation:

Fine-tuning (LoRA): Standard post-hoc fine-tuning on instruction data can suppress but does not erase the breaker direction; boosting the embedding norm at inference-time restores SER, indicating only a raised activation threshold rather than removal.
Model Merging: Merging the attacked base with a clean reference (via linear, SLERP, or TIES merging) preserves the attack, with SER remaining high post-merge.

Stealth Under Auditing: Breaker tokens are engineered to be spectrally mimetic, i.e., they exhibit embedding residuals with respect to the donor’s main subspace that are statistically indistinguishable from the in-distribution tokens, evading outlier-based and spectral auditing (including tools like Magikarp). Moreover, donor-side utility and standard generation metrics are unaffected, confirming that malicious functionality is latent and not directly detectable without behavioral differentials.

Controllability: The donor subspace penalty trade-off parameter ( $\lambda$ ) admits a wide operational window: moderate regularization completely silences the donor while maintaining strong base activation.

Implications for Model Security and Theoretical Insights

Vulnerability of Coefficient-Reuse Approaches: The paper exposes that the standard mathematical paradigm for tokenizer transplant (shared-basis, coefficient-reuse reconstruction from anchor tokens) constitutes an intrinsic structural vulnerability. Specifically, the assumption that "close geometric approximation implies semantic innocuity" is invalid: subspace misalignments between donor and base enable an asymmetric realizability gap for adversarial constructs.

Attack Generality: The optimization framework and attack generalize robustly to a variety of transplant operators, including OMP, FOCUS, CLP, and WECHSEL. For all operator classes, the exploit is feasible so long as a convex combination or sparse mixture of shared anchors is used.

Limits of Geometric Auditing: Simple file-based or geometric heuristics (e.g., residual or norm-based) are insufficient for detection, given the algebraic constructions underpinning the breaker token span the donor in-distribution. This increases the defensive burden, requiring dedicated behavioral auditing and differential statistical analysis.

Practical Threats: Three classes of adversarial payloads can be instantiated via this mechanism:

Reputation Poisoning: Mapping to toxic/offensive semantics, undermining deployment safety.
Adversarial Watermarking: Implanting latent signature tokens for IP/copyright enforcement.
Service Degradation: Targeting structural tokens to disrupt generation, e.g., mapping to EOS or loop-inducing codes.

Limitations and Directions for Future Work

The scope is currently limited to training-free, shared-basis operators and text-based LLMs. Scenarios involving full embedding re-training, data-driven transplant, or multimodal/base architectures are left open. Additionally, while the attack robustly evades known auditing schemes, it is theoretically possible that resource-intensive, specialized forensic strategies (e.g., cryptographic or metadata provenance, whole-embedding clustering) may partially mitigate risk. Extending this geometric vulnerability analysis to multimodal or fundamentally distinct input spaces (e.g., non-Latin scripts) is a natural next step.

Implications for Modular AI, Safety, and Supply Chain Robustness

This research mandates a shift from efficiency-centric, heuristics-driven tooling towards principled, security-aware transplantation in modular AI pipelines. For practitioners:

Tokenizer transplants from untrusted sources must be scrutinized via rigorous, behaviorally-oriented audits, including pre/post emission statistics and stress testing across prompts and decoding regimes.
Reliance on file or norm-based filtering is unsafe for supply-chain integrity.
Defensive measures may require both protocol modifications (forcing re-training of transplanted embeddings or certifying transformation operators) and infrastructure support (secure tokenizer provenance, authenticated vocabulary artifacts).

On a theoretical level, this work connects embedding geometry, spectral subspaces, and neural composition with adversarial machine learning, highlighting the need to explicitly secure operator-induced interfaces beyond explicit parameter or data manipulations.

Conclusion

"The Trojan in the Vocabulary" (2601.00065) demonstrates that shared-basis tokenizer transplantation introduces a latent but severe vulnerability in open-weight LLM composition pipelines. A single, carefully optimized token embedding is sufficient to implant persistent, stealthy behavioral triggers in the post-transplant base model, surviving both post-hoc fine-tuning and model merging, while evading state-of-the-art geometric auditing and forensic tools. Both the empirical and theoretical analyses emphasize that interoperability shortcuts in modular AI can compromise the foundational trust model of open development. The results call for a fundamental reevaluation of token alignment protocols with explicit attention to security and supply chain integrity, and provide a foundation for future research on the intersection of model geometry, composition, and adversarial robustness.