Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models (2410.18927v1)

Published 24 Oct 2024 in cs.CR

Abstract: Multimodal LLMs (MLLMs) are showing strong safety concerns (e.g., generating harmful outputs for users), which motivates the development of safety evaluation benchmarks. However, we observe that existing safety benchmarks for MLLMs show limitations in query quality and evaluation reliability limiting the detection of model safety implications as MLLMs continue to evolve. In this paper, we propose \toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol that aims to address the above limitations, respectively. We first design an automatic safety dataset generation pipeline, where we employ a set of LLM judges to recognize and categorize the risk scenarios that are most harmful and diverse for MLLMs; based on the taxonomy, we further ask these judges to generate high-quality harmful queries accordingly resulting in 23 risk scenarios with 2,300 multi-modal harmful query pairs. During safety evaluation, we draw inspiration from the jury system in judicial proceedings and pioneer the jury deliberation evaluation protocol that adopts collaborative LLMs to evaluate whether target models exhibit specific harmful behaviors, providing a reliable and unbiased assessment of content security risks. In addition, our benchmark can also be extended to the audio modality showing high scalability and potential. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs (e.g., GPT-4o, Gemini), where we revealed widespread safety issues in existing MLLMs and instantiated several insights on MLLM safety performance such as image quality and parameter size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  2. Visual question answering instruction: Unlocking multimodal large language model to domain-specific visual multitasks. arXiv preprint arXiv:2402.08360, 2024.
  3. Image retrieval on real-life images with pre-trained vision-and-language models. In IEEE/CVF International Conference on Computer Vision, 2021.
  4. An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models. arXiv preprint arXiv:2403.09766, 2024.
  5. On the adversarial robustness of multi-modal foundation models. In IEEE/CVF International Conference on Computer Vision, 2023.
  6. On the robustness of large multimodal models against image adversarial attacks. arXiv preprint arXiv:2312.03777, 2023.
  7. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In International Conference on Learning Representations, 2024.
  8. Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms. arXiv preprint arXiv:2402.14872, 2024.
  9. Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031, 2024.
  10. Simpo: Simple preference optimization with a reference-free reward. CoRR, abs/2405.14734, 2024.
  11. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  12. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862, 2022.
  13. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  14. A comprehensive evaluation framework for deep model robustness. Pattern Recognition, 2023.
  15. Robustart: Benchmarking robustness on architecture design and training techniques. ArXiv, 2021.
  16. Training robust deep neural networks via adversarial noise propagation. TIP, 2021.
  17. Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. arXiv preprint arXiv:2311.12075, 2023.
  18. Pre-trained trojan attacks for visual recognition. arXiv preprint arXiv:2312.15172, 2023.
  19. Poisoned forgery face: Towards backdoor attacks on face forgery detection. arXiv preprint arXiv:2402.11473, 2024.
  20. GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  21. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems, 2023.
  22. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  23. Unveiling the safety of gpt-4o: An empirical study using jailbreak attacks. CoRR, abs/2406.06302, 2024.
  24. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models, 2024.
  25. Figstep: Jailbreaking large vision-language models via typographic visual prompts. CoRR, abs/2311.05608, 2023.
  26. Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. CoRR, abs/2404.03027, 2024.
  27. OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card, 2024.
  28. Gemini: A family of highly capable multimodal models, 2024.
  29. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 3356–3369. Association for Computational Linguistics, 2020.
  30. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023.
  31. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3309–3326. Association for Computational Linguistics, 2022.
  32. Benchmarking cognitive biases in large language models as evaluators, 2024.
  33. Privlm-bench: A multi-level privacy evaluation benchmark for language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 54–73. Association for Computational Linguistics, 2024.
  34. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics, 2022.
  35. FFT: towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity. CoRR, abs/2311.18580, 2023.
  36. Sorry-bench: Systematically evaluating large language model safety refusal behaviors, 2024.
  37. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.
  38. Benchmarking trustworthiness of multimodal large language models: A comprehensive study, 2024.
  39. Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions, 2024.
  40. How many unicorns are in this image? A safety evaluation benchmark for vision llms. CoRR, abs/2311.16101, 2023.
  41. Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks, 2024.
  42. Cross-modality safety alignment, 2024.
  43. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024.
  44. Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020.
  45. Visual adversarial examples jailbreak aligned large language models, 2023.
  46. Jigsaw. Perspective. [Online]. https://perspectiveapi.com/.
  47. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
  48. Jailbreaking black box large language models in twenty queries. CoRR, abs/2310.08419, 2023.
  49. Tree of attacks: Jailbreaking black-box llms automatically. CoRR, abs/2312.02119, 2023.
  50. Visual adversarial examples jailbreak aligned large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 21527–21536. AAAI Press, 2024.
  51. Jailbreak vision language models via bi-modal adversarial prompt. CoRR, abs/2406.04031, 2024.
  52. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
  53. Meta. Use-policy. [Online]. https://ai.meta.com/llama/use-policy/.
  54. An overview of recent approaches to enable diversity in large language models through aligning with human perspectives. In Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives)@ LREC-COLING 2024, pages 49–55, 2024.
  55. Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 16160–16176. Association for Computational Linguistics, 2024.
  56. Planning in natural language improves llm search for code generation, 2024.
  57. Jailbreakeval: An integrated toolkit for evaluating jailbreak attempts against large language models. CoRR, abs/2406.09321, 2024.
  58. Masterkey: Automated jailbreaking of large language model chatbots. In Proceedings 2024 Network and Distributed System Security Symposium. Internet Society, 2024.
  59. Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges. CoRR, abs/2406.12624, 2024.
  60. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  61. Evaluating large language models at evaluating instruction following, 2024.
  62. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232, 2023.
  63. Hallucination is inevitable: An innate limitation of large language models. CoRR, abs/2401.11817, 2024.
  64. Does refusal training in llms generalize to the past tense?, 2024.
  65. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want, 2024.
  66. Can large multimodal models uncover deep semantics behind images? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 1898–1912. Association for Computational Linguistics, 2024.
  67. Parler-tts. https://github.com/huggingface/parler-tts, 2024.
  68. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  69. Multimodal chain-of-thought reasoning in language models, 2024.
  70. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  71. Cogvlm: Visual expert for pretrained language models. 2023.
  72. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
  73. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
  74. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  75. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
  76. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  77. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  78. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  79. Sharegpt4v: Improving large multi-modal models with better captions, 2023.
  80. Yi: Open foundation models by 01.ai, 2024.
  81. Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, 2024.
  82. Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024.
  83. Google DeepMind. Gemini pro. https://deepmind.google/technologies/gemini/pro/, 2024.
  84. Google DeepMind. Gemini flash. https://deepmind.google/technologies/gemini/flash/, 2024.
  85. OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024.
  86. AI@Meta. Llama 3 model card. 2024.
  87. Gemma Team. Gemma. 2024.
  88. Internlm2 technical report, 2024.
  89. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  90. Mixtral AI. A high quality sparse mixture-of-experts. https://mistral.ai/news/mixtral-of-experts/, 2023.
  91. Databricks. Introducing dbrx: A new state-of-the-art open llm. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, 2024.
  92. X.AI. Grok-2 beta release. https://x.ai/blog/grok-2, 2024.
  93. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  94. Qwen. Qwen2-vl-7b-instruct model card. https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct, 2024.
  95. Microsoft. Phi-3-vision-128k-instruct. online, 2024.
  96. Microsoft. Phi-3.5-vision-instruct. online, 2024.
  97. Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
  98. Hugging Face. Hugging face open llm leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.
  99. OpenCompass. Opencompass leaderboard. https://rank.opencompass.org.cn/leaderboard-llm/?m=24-07, 2024.
  100. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
  101. Mini-omni: Language models can hear, talk while thinking in streaming, 2024.
  102. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023.
  103. Qwen2-audio technical report, 2024.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com