Generalization to novel modalities and tasks

Determine whether safety-aligned multimodal guardrails trained on image–text data—specifically the CrossGuard intent-aware model fine-tuned on implicit and explicit multimodal samples—generalize to entirely novel modalities (beyond images and text) or to tasks outside the evaluated scope; rigorously assess and characterize their robustness and adaptability when confronted with previously unseen modality types and task settings.

Background

The paper trains CrossGuard on image–text datasets, demonstrating strong robustness across both in-domain and out-of-domain settings for jailbreak detection and safe utility tasks. Despite these results, the authors explicitly note a limitation regarding the model’s ability to generalize beyond the specific modalities and tasks studied.

This open question targets the broader applicability of safety alignment when the deployment environment involves different modality types (e.g., audio, video, or other sensor inputs) or tasks not covered by the current evaluation, highlighting a need for systematic assessment and potential methodological extensions.

References

Third, despite strong performance across in-domain and out-of-domain benchmarks, generalization to entirely novel modalities or tasks beyond our current scope remains open.

— CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks (2510.17687 - Zhang et al., 20 Oct 2025) in Section: Limitations

Generalization to novel modalities and tasks

Background

References

Related Problems