Stealth benefit of image semantics when text alone suffices

Ascertain whether incorporating image semantics within the Cross-modal Adversarial Multimodal Obfuscation (CAMO) prompts provides additional stealth benefits in scenarios where textual cues alone are sufficient to execute the attack.

Background

CAMO can succeed in both text-only and image+text configurations. When the textual component already enables reconstruction, it is unclear whether adding semantically aligned images improves evasion of safety filters or reduces detectability.

Clarifying this would inform when to include or omit visual content in practice and help shape defenses that leverage cross-modal consistency checks.

References

In cases where textual cues alone suffice, the added value of image semantics in enhancing stealth is unclear.

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models (2506.16760 - Jiang et al., 20 Jun 2025) in Section: Limitation and Future Work