Comprehensive robustness against Adversarial Smuggling Attacks
Determine how to achieve comprehensive robustness of Multimodal Large Language Models against Adversarial Smuggling Attacks in content moderation settings, beyond the partial protection provided by supervised fine-tuning.
References
Our work takes a first step toward addressing this gap by investigating the efficacy of integrating specific adversarial examples into the SFT process. Our experiments show that while SFT provides a degree of defense against smuggling attacks, achieving comprehensive robustness remains a challenging open problem.
— Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
(2604.06950 - Li et al., 8 Apr 2026) in Appendix, Extended Related Work, Section "Content Moderation for MLLMs"