Modality Sabotage in Multimodal Systems
- Modality sabotage is a phenomenon where adversarial manipulations across input modalities override safety protocols in multimodal systems.
- It employs techniques like adversarial substitution, fusion domination, and structural prompt perturbation to mislead or hijack system outputs.
- Mitigation strategies include specialized harm filters, adversarial fine-tuning, and explicit fusion gating to maintain robust and safe model behavior.
Modality sabotage refers to a diverse family of vulnerabilities, adversarial attacks, and diagnostic failure modes in which one or more input modalities are deliberately leveraged, perturbed, or manipulated to override, subvert, or mislead multimodal systems. Unlike classic unimodal adversarial or poisoning attacks, modality sabotage exploits the cross-modal pipelines—specifically, the imperfectly aligned representational spaces and safety mechanisms in modern multimodal language and perception models—which enable attackers or noise sources to bypass pretrained defenses, trigger failures, or degrade system reliability. This concept is manifested in model jailbreaking via non-textual signals, adversarial cross-modal data poisoning, misalignment-induced backdoors, structural template manipulation, fusion-level dominance, and destructive feature interactions across a wide spectrum of model families and tasks.
1. Foundations and Definitions
Modality sabotage is instantiated whenever the contribution of a particular input modality—vision, audio, text, or otherwise—surreptitiously or disproportionately subverts system-level goals. In the safety context, the definitive instance is the bypassing of text-trained guardrails in multimodal LLMs (MLLMs) by encoding malicious instructions in images or audio, resulting in model outputs that would be refused if the input were text-only (Geng et al., 31 May 2025, Kumar et al., 23 Oct 2025). In retrieval and surveillance, modality sabotage can occur when adversarial perturbations to one modality (e.g., RGB images) corrupt cross-modal fusion outputs to the point that both unimodal and multimodal identification fail (Bian et al., 22 Jan 2025). In diagnostic and evaluative settings, sabotage emerges when high-confidence errors in one modality dominate or override correct signals in others, "dragging" the fused prediction away from ground truth (Zhang et al., 4 Nov 2025).
Formally, let be a multimodal system, inputs for modalities, the fused predictor (classification, generation, or other). Modality sabotage arises under any of the following:
- Adversarial Substitution: such that induces an unsafe or incorrect (e.g., forbidden guidance, backdoored response), while or alone would not.
- Fusion Domination: The confidence or influence of a sabotaging modality overpowers others, such that the fused score or output is "hijacked" even in the presence of correct evidence from other streams (Zhang et al., 4 Nov 2025).
- Structural Manipulation: Modality/role token arrangement or embedding positions are perturbed (e.g., swapping role labels, moving image tokens) to induce generation errors or safety bypasses, despite unchanged query content (Shayegani et al., 1 Apr 2025).
- Backdoor via Misalignment: Training set data is poisoned by cross-modal mismatches, resulting in "semantic triggers"—inputs whose cross-modal inconsistency silently activates an attacker-chosen response (Zhong et al., 8 Jun 2025).
The breadth of modality sabotage encompasses both attack and diagnostic perspectives, spanning input-level, embedding-level, and template-level manipulations, as well as fusion-stage dominance.
2. Attack Methodologies and Mechanisms
Several attack paradigms instantiate modality sabotage, each exploiting different aspects of multimodal architectures.
2.1. Embedding-Space Instruction Embedding
The Con Instruction attack (Geng et al., 31 May 2025) demonstrates that images or audio can be optimized directly in embedding space to mimic the representation of forbidden instructions. The procedure is as follows:
- Gray-box Access: Partial access to the victim’s vision/audio encoder, token embedding, and fusion module.
- Sample Generation: From random noise , gradient descent is performed on an objective aligning the embedding of after multimodal fusion to the embedding of a forbidden instruction:
where is the instruction encoding, is the fused vision/audio encoding, and the last rows are matched for alignment.
- Deployment: The adversarial is submitted (often with a null or harmless text prompt) such that the model “sees” the forbidden instruction and emits a harmful response.
This modality sabotage does not require any actual text trigger in the input, and completely subverts text-based safety policies.
2.2. Structural Prompt Perturbation
Role-Modality Attacks (RMAs) (Shayegani et al., 1 Apr 2025) structurally manipulate input templates of instruction-following VLMs by:
- Swapping user/assistant role tokens
- Relocating the position of the modality marker token
Without modifying the actual query, these perturbations bypass prompt-structure-dependent safety measures. The attack is measurable in the model’s residual stream: the attack-shift vector aligns with the “negative refusal direction” in activation space, signaling a bypass event at scale.
2.3. Perceptual and Semantic Transformations
Jailbreaking can also be achieved with visually or auditorily engineered transformations (Kumar et al., 23 Oct 2025):
- FigStep-Pro: Decomposes forbidden text into spatially separated image regions to evade OCR.
- Intelligent Masking: Hides partial instructions in images, leaving placeholders in text.
- Waveform Transformations: Audio attacks (echo, pitch, speed changes, combination) degrade signature detection.
These techniques are effective precisely because multimodal fusion pipelines interpret the cross-modal content semantically, crossing modality boundaries to execute embedded instructions or convey unsafe content.
2.4. Cross-Modal Backdoors
Semantic misalignment poisoning (e.g., BadSem (Zhong et al., 8 Jun 2025)) relies on creating image–text pairs with deliberate cross-modal inconsistency during fine-tuning. The result is a model that responds correctly on clean data but, for mismatched pairs, consistently emits an attacker-specified output.
This is formalized as minimizing the standard loss on clean data while mapping all mismatched examples to a fixed target:
Attention visualization reveals that these triggers manifest as abnormal cross-modal attention patterns in deep fusion layers.
3. Quantitative Impact and Diagnostic Frameworks
Major studies report high attack success rates (ASR) and demonstrate that classic unimodal or text-centric protections are insufficient:
| Attack / Setting | ARC or ASR (%) | Model/Benchmark/Notes |
|---|---|---|
| Con Inst. (Con+Hypo) | 81.3 / 86.6 | LLaVA-13B, AdvBench/SafeBench (Geng et al., 31 May 2025) |
| Con Inst. (audio) | 77.6 | Qwen-Audio, SafeBench |
| RMAs (post AT) | 0–3 | All models, after adversarial fine-tuning (Shayegani et al., 1 Apr 2025) |
| FigStep-Pro | 89.0 | Llama-4-Maverick, harmful content (Kumar et al., 23 Oct 2025) |
| BadSem Backdoor | 98–100 | VQAv2/GQA, cross-modal semantic triggers (Zhong et al., 8 Jun 2025) |
| PolyJailbreak (RL-based) | 83.3 | Eight MLLMs (Wang et al., 20 Oct 2025) |
| mAP Drop Rate (person re-ID) | 49–62.7 | Multi/cross-modality, MUA (Bian et al., 22 Jan 2025) |
Larger model size does not guarantee robustness: larger open-source and closed models are, in some cases, more sensitive to image/audio adversarial alignment, with ARC metrics as high or higher than smaller counterparts (Geng et al., 31 May 2025).
Fusion-level sabotage can be quantitatively diagnosed through:
- Attack Response Categorization (ARC): Classifies model outputs as Irrelevant, Refusal, Superficial, or Success; multiple prompted generations are analyzed to estimate the actual rate of forbidden compliance.
- Cluster Separation Ratio (CSR): Evaluates the separability of malicious/benign activations at various fusion layers; lower CSR reflects alignment collapse (Wang et al., 20 Oct 2025).
- Contribution/Saboteur Attribution: In multi-agent audit layers, high-confidence errors from a modality that dominate the fusion are directly counted as successful sabotage (Zhang et al., 4 Nov 2025).
4. Defenses and Mitigation Strategies
No single defense technique entirely protects against modality sabotage; partial successes are found in the following broad categories:
4.1. Specialized Harm Filters
- MLLM-Protector: Deploys an external, modality-aware harm classifier and answer detoxifier; achieves sub-12% attack success irrespective of model scale (Geng et al., 31 May 2025).
- Post-hoc filtering: ECSO and related self-assessment rewriters reduce explicit ARC but are easily bypassed if the model “sees” the adversarial signal in fusion.
4.2. Adversarial and Structural Fine-Tuning
- Adversarial fine-tuning with multi-modal data (VLGuard LoRA, multimodal contrastive loss): Reduces attack rates by half, but >40% residual success remains (Geng et al., 31 May 2025, Wang et al., 20 Oct 2025).
- Structural adversarial data augmentation: In RMA defense, adversarial training over template permutations/role swaps brings ASR to near zero without utility loss (Shayegani et al., 1 Apr 2025).
4.3. Input Perturbation
- Extreme Gaussian noise on images/audio degrades adversarial alignment but also corrupts legitimate inputs, providing an unreliable trade-off in practice (Geng et al., 31 May 2025).
- Diffusion-based denoising and perceptual anomaly detection partially suppress attacks (reducing ASR by 15–30%), but engineered transformations often survive (Kumar et al., 23 Oct 2025, Dou et al., 10 Sep 2024).
4.4. Consistency and Causal Regularization
- Consistency loss: JS divergence between logits of original/perturbed examples (with ) trains models to ignore spurious signals from irrelevant modalities (Cai et al., 26 May 2025).
- Explicit fusion gating: Post-hoc blocking of dominating/sabotaging modalities (e.g., zeroing modal confidence when an error is detected in audit) (Zhang et al., 4 Nov 2025).
4.5. Open Challenges
Effective defense requires modality-agnostic safety heads, cross-modal consistency checking, adversarial finetuning with a wide library of cross-modal perturbations, and robust architectural design that prevents vision or audio fusion from diluting text-aligned guardrails.
5. Applications, Instances, and Broader Relevance
5.1. Safety and Jailbreaking
- Jailbreaking via Images/Audio: Universal bypass of model refusals in high-risk safety domains (harmful content, CBRN, CSEM), with best ASR on leading MLLMs (Kumar et al., 23 Oct 2025, Geng et al., 31 May 2025).
- Systemic Safety Alignment Gaps: Safe Inputs but Unsafe Output (SIUO) demonstrates that cross-modal fusion alone suffices to break models in 40–60% of cases, even when each modality individually is harmless (Wang et al., 21 Jun 2024).
5.2. Security and Data Poisoning
- Cross-modal backdoor attacks generalize to data poisoning in person re-ID (Bian et al., 22 Jan 2025), diffusion models (Wang et al., 30 Oct 2025), and multimodal classifiers (Zhang et al., 31 Jul 2024), all without visible artifacts and with minimal utility drop on clean tasks.
5.3. Robustness and Evaluation
- Modality conflict and interference highlight hallucination risks and performance collapse under self-inconsistent cross-modal input; regularization and RL-based fine-tuning can halve hallucination rates (Cai et al., 26 May 2025, Zhang et al., 9 Jul 2025).
- Diagnostic frameworks for contributor/saboteur identification provide actionable interventions (gating, calibration, deferral) for critical applications (e.g., medical triage, multimodal emotion recognition) (Zhang et al., 4 Nov 2025).
5.4. Theoretical Perspectives
- Dynamic Logic: Sabotage modal logic and sabotage game logic formalize the semantics of rule-breaking and adversarial intervention at the logical level, illustrating the expressive power required to model sabotage processes (Zhao, 2020, Wafa et al., 15 Apr 2024).
6. Theoretical Implications and Future Trajectories
The existence and potency of modality sabotage across architectures, tasks, and metrics suggests that true alignment and robustness in multimodal systems must be mode-complete:
- Cross-modal alignment and adversary-aware training are necessary to prevent the exploitation of under-aligned modalities (e.g., vision or audio encoders rarely subjected to constitutional RLHF).
- Fusion-layer diagnostics and gating will be increasingly vital for transparency, offering instance-level attribution of sabotage.
- Benchmark development (e.g., SIUO (Wang et al., 21 Jun 2024), MMMC (Zhang et al., 9 Jul 2025)) accelerates the precise quantification of cross-modal vulnerabilities, which is essential for future-proofing MLLM deployment.
- Paradigm shift to semantic-level reasoning: Defenses must generalize across input surfaces, requiring the abstraction of content beyond its syntactic presentation in any one modality.
Open questions developed in the literature include:
- How to design scalable adversarial data generators that cover the vast space of real-world cross-modal sabotage scenarios (Kumar et al., 23 Oct 2025).
- Methods for joint, per-example multimodal calibration that dynamically adjusts confidence and fusing weights in the presence of sabotage (Zhang et al., 4 Nov 2025).
- Development of input-standardization pipelines and architectural motif that explicitly enforce and monitor cross-modal safety invariants (Geng et al., 31 May 2025, Wang et al., 20 Oct 2025).
The collective findings underscore that modality sabotage is not an isolated phenomenon but a structural vulnerability at the core of current multimodal machine intelligence. Robust alignment, detection, and correction require both theoretical innovations and sustained empirical paper.