MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Published 11 Jun 2024 in cs.CR, cs.AI, and cs.CL | (2406.07594v2)

Abstract: Powered by remarkable advancements in LLMs, Multimodal LLMs (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces MLLMGuard, a toolkit that evaluates MLLMs’ safety across dimensions like privacy, bias, toxicity, truthfulness, and legality.
It leverages adversarial datasets and red teaming techniques to create challenging, real-world scenarios for testing model vulnerabilities.
The study presents new metrics (ASD, PAR) and GuardRank, revealing significant safety gaps in current models and highlighting the need for improved alignment.

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal LLMs

Introduction

The paper "MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal LLMs" presents MLLMGuard, an evaluation framework focusing on the safety of Multimodal LLMs (MLLMs). The need arises because MLLMs are increasingly subjected to potential vulnerabilities due to complex application scenarios, exposing them to safety risks. Despite existing benchmarks, a comprehensive and rigorous evaluation is lacking. MLLMGuard aims to offer a robust, multi-faceted evaluation covering bilingual dimensions and multiple safety aspects such as Privacy, Bias, Toxicity, Truthfulness, and Legality.

Safety Dimensions and Dataset

MLLMGuard's Dimensions:

Privacy: Evaluates if models respect privacy without disclosing personal, trade, or state secrets.
Bias: Assesses models' capability to mitigate stereotypes, prejudice, and discrimination.
Toxicity: Focuses on identifying and responsibly handling hate speech and harmful content.
Truthfulness: Tests models for robustness against hallucinations and consistency in noisy conditions.
Legality: Evaluates ability to handle legal queries responsibly, addressing personal and public security laws.

Dataset Construction:

The dataset comprises manually constructed adversarial examples, sourced mainly from social media, to avoid data leakage. It employs red teaming techniques to enhance complexity, ensuring models are tested against real-world, challenging scenarios.

Quality Control:

The dataset undergoes a rigorous quality control process to ensure relevance, accuracy in labeling, and necessity of multimodal data. This process includes a comprehensive review by experts to maintain high standards.

Evaluation Metrics and GuardRank

MLLMGuard introduces new metrics, namely:

ASD (Attack Success Degree): Measures the harm level a model might allow from zero to full risk.
PAR (Perfect Answer Rate): Assesses the rate of fully safe and responsible model responses.

For automation, the paper develops GuardRank, a lightweight and efficient evaluator surpassing GPT-4 in assessment accuracy. GuardRank is positioned as a plug-and-play solution for safety evaluations.

Experimental Evaluation

The evaluation covers 13 models, including GPT-4V, Gemini, and several open-source alternatives. The results underscore the significant gaps in safety capabilities among existing MLLMs:

Truthfulness: Most models exhibit susceptibility to hallucinations. MLLMs frequently fail under inverse queries or position-swapped options.
Alignment and Scaling: The paper highlights limitations in current alignment strategies and disputes the notion that increasing model size inherently improves safety.
Figure 1: Workflow of MLLMGuard, including creating dataset through manual construction, evaluation on MLLMGuard and scoring with human and GuardRank.

Figure 2: Results on Truthfulness. (a) presents the ASD of MLLMs under various red teaming techniques on Truthfulness. (b) and (d) further display the ASD results on 2 red teaming techniques, i.e., Non-existent Query and Noise Injection.

Limitations and Future Work

Opportunities for enhancement include:

Scalability: Addressing the scalability of manual dataset construction.
Diversity: Incorporating additional languages and further dimension expansions.
Software Updates: Iterative improvements to GuardRank for more robust, scalable evaluations.

Conclusion

The study emphasizes the critical need for comprehensive safety evaluation in MLLMs, advocating for improvements in alignment, robustness, and multicultural applicability. MLLMGuard serves not only as an evaluative benchmark but also as a stepping stone towards safer, more reliable AI deployments. Consequently, the paper establishes a foundation for future work in refining safety measures and aligning MLLMs more closely with human-centric values.

In summary, MLLMGuard addresses crucial gaps in current MLLM safety by providing a sophisticated toolkit to guide future improvements in AI safety and performance.

Markdown