Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model (2505.06538v2)

Published 10 May 2025 in cs.CL

Abstract: The rapid development of Multimodal Large Reasoning Models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, the long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at https://github.com/xinyuelou/Think-in-Safety.

Summary

The paper evaluates the safety alignment of multimodal large reasoning models (MLRMs), revealing significant safety performance degradation, particularly under jailbreak attacks.
Extended reasoning improves MLRMs' capability for safety-awareness tasks, while modal ablation suggests that unimodal text inputs lead to better safety alignment than multimodal inputs.
A novel multimodal dataset containing safety-oriented thought processes is introduced, which effectively enhances MLRM safety performance through supervised fine-tuning.

Safety Alignment in Multimodal Large Reasoning Models: An Evaluation and Dataset Proposal

The paper “Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model” presents a rigorous evaluation of the safety performance in multimodal large reasoning models (MLRMs) and introduces a novel dataset aimed at enhancing their safety. MLRMs have shown remarkable potential in handling complex reasoning tasks across multiple modalities, contributing significantly to advancements in artificial intelligence. However, their propensity to exhibit safety vulnerabilities, particularly in adversarial scenarios, necessitates a systematic examination and improvement of their safety alignment capabilities.

Core Findings and Analysis

The authors conducted an exhaustive safety assessment of existing MLRMs over 11 models and across 5 established benchmarks, which include tasks designed to test both jailbreak robustness and safety-awareness. The findings from these evaluations reveal that MLRMs generally suffer from a marked decrease in safety performance, especially under jailbreak conditions. Specifically, harmful intent and content often bypass safety mechanisms embedded within these models, leading to a significant increase in the attack success rates on adversarial inputs.

Conversely, the paper identifies an intriguing phenomenon wherein the engagement in extended thought processes improves the models' capabilities for safety-awareness tasks. MLRMs demonstrated stronger performance in identifying potentially unsafe intents when allowed to leverage their intrinsic reasoning capabilities during long thought sequences. This suggests a paradox in MLRM behavior, where extended reasoning can both enable and inhibit safe outcomes, dependent on the nature of the task.

Evaluation of Multimodal and Unimodal Inputs

Further analysis involved a modal ablation paper, substituting image inputs with text captions to evaluate the effect of modality on safety performance. The results indicated an improvement in safety alignment when reasoning pathways relied exclusively on textual inputs, pointing toward a more robust detection of unsafe intents in unimodal scenarios. This highlights a potential limitation in the multimodal integration within current MLRMs, suggesting a need for more effective strategies to handle the added complexity of multiple input types.

Proposed Dataset for Safety Enhancement

Motivated by these findings, the paper introduces a multimodal dataset enriched with safety-oriented thought processes, designed to bolster the safety capabilities of MLRMs via supervised fine-tuning. This dataset captures structured, safety-aware reasoning pathways, aimed at preserving the reasoning advantages inherent to MLRMs while systematically addressing their vulnerabilities. The experimental results on models fine-tuned with this dataset show marked improvements in safety performance across diverse benchmarks, outperforming existing datasets by providing more comprehensive thought processes and aligning reasoning pathways to safety-oriented objectives.

Implications on AI Development

The paper provides a fresh perspective on the development of safe MLRMs. The proposed dataset and its demonstrated efficacy in enhancing safety through reasoning-based alignment present a promising direction for future research. This work invites further exploration into more efficient dataset construction methods and training strategies tailored for MLRMs, potentially contributing to safer and more reliable AI systems capable of handling complex, real-world scenarios.

Conclusion

The paper “Think in Safety” offers a significant contribution to the understanding and enhancement of safety alignment in MLRMs. By unveiling prevalent safety risks and offering a novel dataset aimed at mitigating them, the paper provides both empirical insights and practical solutions for advancing the reliability of multimodal large reasoning models. This work serves as a foundational step, encouraging ongoing research to refine these approaches and ensure more robust safety mechanisms in advanced AI models.

Related Papers

GitHub

GitHub - xinyuelou/Think-in-Safety

Tweets

https://twitter.com/shivanshpuri35/status/1935230160669687917