Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 23 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 93 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 183 tok/s Pro

2000 character limit reached

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models (2504.18053v2)

Published 25 Apr 2025 in cs.CL and cs.CV

Abstract: Multimodal LLMs (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.

Collections

Summary

The paper introduces DREAM, a framework that disentangles multimodal risks to enhance safety alignment in large language models.
It employs a two-stage process of risk-aware fine-tuning and preference optimization, achieving a 16.17% safety improvement on the SIUO benchmark.
The framework maintains task effectiveness while robustly detecting and categorizing potential threats in integrated visual and textual data.

Disentangling Risks to Enhance Safety in Multimodal LLMs

Introduction

The integration of visual and textual data within Multimodal LLMs (MLLMs) presents new safety challenges due to the potential for complex multimodal risk combinations. The paper introduces DREAM (Disentangling Risks to Enhance Safety Alignment in MLLMs), aiming to address these safety issues through multimodal risk disentanglement and an iterative training approach. DREAM enhances the intrinsic safety awareness of MLLMs by combining supervised fine-tuning with Reinforcement Learning from AI Feedback (RLAIF).

Multimodal Risk Disentanglement

MLLMs inherently face heightened safety challenges compared to traditional text-only LLMs due to the integration of visual inputs, which introduce new dimensions of potential attack. The paper begins by detailing these unique complex risk combinations.

Figure 1: Risk combinations within image-text inputs. The interplay of safe and unsafe elements creates complex risk combinations.

To combat this challenge, the preliminary step involves decomposing risks within multimodal inputs using finely-tuned prompts to guide MLLMs to perform Multimodal Risk Disentanglement (MRD). MRD effectively identifies and categorizes risks within visual and textual data separately, enhancing MLLM's ability to recognize potential threats. Altogether, this meticulous approach to risk observation lays the groundwork for the proposed DREAM method.

The DREAM Framework

DREAM leverages the insights from MRD through two core components: Risk-aware Fine-tuning and Risk-aware Preference Optimization, which work together to enhance MLLM safety.

Risk-aware Fine-tuning

The first stage involves fine-tuning the model to incorporate the ability to recognize and respond to risks without requiring explicit MRD prompts at inference time. The method utilizes a teacher-student architecture where the teacher model generates risk-disentangled observations, which are then utilized to synthesize high-quality training data for the student model.

Figure 2: Illustration of DREAM. The two main components work synergistically to improve MLLM safety.

Risk-aware Preference Optimization

Building upon the fine-tuned model, the second stage uses MRD as a feedback mechanism. By employing a preference optimization strategy akin to Direct Preference Optimization (DPO), DREAM iteratively enhances safety alignment. Two innovative score-generation methods, Observation-wise Score and Global Score, are employed to rank sampled responses effectively. This stage iteratively refines the model’s outputs by continuously updating the preference dataset based on new observations, thus mitigating oversafety while maintaining effectiveness.

Experimental Evaluation

DREAM was evaluated across benchmarks including SIUO, FigStep, and VLGuard. The results consistently indicate improvements in both safety and effectiveness metrics. DREAM achieves a 16.17% improvement in the SIUO safe{content}effective score compared to GPT-4V, underscoring the efficacy of the approach.

Figure 3: SIUO performance improves with scaled training samples, demonstrating DREAM's efficiency.

In addition to enhancing safety metrics, DREAM maintains competitive performance with existing models on utility benchmarks like MME and MM-Vet, highlighting its comprehensive capability in risk detection and response.

Conclusion

DREAM offers a promising method of aligning MLLM safety through a structured approach to risk disentanglement and iterative optimization. By training models to recognize and appropriately respond to multimodal risks, it significantly enhances MLLM robustness against complex safety challenges without compromising on task effectiveness. Future work may extend this framework to include diverse modalities such as video and audio, further broadening the practical applicability of this approach for real-world MLLM deployments.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (17)

First 10 authors:

GitHub

GitHub - Kizna1ver/DREAM: The official repo for "DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models". (2 stars)