Relative contribution of visual inputs in CAMO

Quantify the relative contribution of visual inputs within the Cross-modal Adversarial Multimodal Obfuscation (CAMO) framework across varying scenarios, determining how and when the visual modality affects attack success and stealth compared to text-only obfuscation.

Background

CAMO distributes sensitive information across text and image modalities to evade unimodal filters. However, the precise benefits of the visual channel may depend on task type, model architecture, and defense configuration.

A systematic analysis would clarify the conditions under which visual clues materially improve CAMO’s effectiveness or stealth, guiding both attack understanding and defense design.

References

Moreover, the relative contribution of visual inputs under varying scenarios has yet to be systematically analyzed.

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models (2506.16760 - Jiang et al., 20 Jun 2025) in Section: Limitation and Future Work