Comprehensive robustness against Adversarial Smuggling Attacks

Determine how to achieve comprehensive robustness of Multimodal Large Language Models against Adversarial Smuggling Attacks in content moderation settings, beyond the partial protection provided by supervised fine-tuning.

Background

The paper introduces Adversarial Smuggling Attacks (ASA), which hide harmful content in human-readable but AI-unreadable visual formats, and shows that state-of-the-art MLLMs exhibit high attack success rates across two pathways: Perceptual Blindness and Reasoning Blockade. The authors construct SmuggleBench to evaluate these threats and test mitigation strategies.

While both test-time (Chain-of-Thought) and training-time (Supervised Fine-Tuning) defenses provide some mitigation, they do not fundamentally resolve the vulnerability. In particular, the authors note that although SFT offers a degree of defense, achieving comprehensive robustness against smuggling attacks remains unresolved.

References

Our work takes a first step toward addressing this gap by investigating the efficacy of integrating specific adversarial examples into the SFT process. Our experiments show that while SFT provides a degree of defense against smuggling attacks, achieving comprehensive robustness remains a challenging open problem.

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation  (2604.06950 - Li et al., 8 Apr 2026) in Appendix, Extended Related Work, Section "Content Moderation for MLLMs"