MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance (2401.02906v3)

Published 5 Jan 2024 in cs.CR, cs.CL, and cs.CV

Abstract: The deployment of multimodal LLMs (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to LLMs, MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.

PDF HTML Abstract

MLLM-Protector: Enhancing Safety in Multimodal LLMs

Understanding the Need for MLLM-Protector

The proliferation of LLMs and their extension, Multimodal LLMs (MLLMs), has ushered in a new era of AI capabilities, particularly in natural language processing. These advancements, however, come with increased vulnerabilities, especially regarding the generation of harmful content in response to malicious inputs. This issue is particularly pronounced in MLLMs, where images can serve as inputs, further complicating the challenge of ensuring content safety. The research presented here introduces MLLM-Protector, a methodology designed to safeguard against such vulnerabilities without detracting from the models' performance.

The Challenge: Safeguarding Performance and Safety

MLLMs' susceptibility to producing unsolicited outputs when presented with manipulated image inputs is a pressing concern. Traditional alignment and tuning strategies, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), face challenges in effectively mitigating these risks for MLLMs due to the complex, continuous nature of image data. Furthermore, existing defense mechanisms often lead to a degradation in the model's original capabilities or fail to generalize across the diverse scenarios MLLMs encounter.

MLLM-Protector: Approach and Architecture

MLLM-Protector addresses MLLMs' vulnerabilities through a two-pronged approach: a harm detector and a response detoxifier. The harm detector is a lightweight classifier trained to identify potentially harmful content generated by the MLLM. Upon detection, the response detoxifier, another trained component, amends the output to adhere to safety standards. This approach maintains the model's performance while ensuring outputs remain within acceptable content boundaries.

Model Components and Training

Harm Detector: Utilizes a pretrained LLM architecture, modified for binary classification to discern harmful content.
Response Detoxifier: Aims to correct harmful responses while maintaining relevance to the user's query, achieving a balance between harmlessness and utility.

The training methodology leverages existing QA datasets annotated with acceptability indicators and exploits powerful models like ChatGPT to generate diverse training samples, encompassing a wide array of potential scenarios and malicious inputs.

Empirical Validation and Insights

The efficacy of MLLM-Protector is demonstrated through rigorous experimentation, showing a notable reduction in the attack success rate (ASR) across various scenarios without significant performance trade-offs. Specifically, the approach almost entirely neutralizes harmful outputs in critical areas such as illegal activity and hate speech, underlining its practical utility.

Future Prospects and Concluding Thoughts

MLLM-Protector sets a precedent for developing robust defense mechanisms that do not compromise on the functional integrity of MLLMs. It opens avenues for future research focused on further refining safety measures, exploring the scalability of such methods, and extending their applicability to newer, more complex MLLM architectures. As the landscape of MLLMs evolves, ensuring these models' safety and reliability will remain paramount, necessitating continual advancements in defense strategies like MLLM-Protector.

PDF Markdown Bookmark Chat (Pro)

References (61)

Authors (9)

Renjie Pi (37 papers)
Tianyang Han (6 papers)
Yueqi Xie (22 papers)
Rui Pan (67 papers)
Qing Lian (19 papers)
Hanze Dong (43 papers)
Jipeng Zhang (46 papers)
Tong Zhang (569 papers)
Jianshu Zhang (36 papers)

Citations (38)

View on Semantic Scholar

Tweets

https://twitter.com/FSFG/status/1803241147545886890