Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs (2311.16101v1)

Published 27 Nov 2023 in cs.CV, cs.CL, and cs.LG

Abstract: This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vLLM-safety-benchmark.

Overview of Safety Evaluation Benchmark for Vision LLMs

The paper, How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs, outlines a nuanced approach to assessing the safety and robustness of Vision LLMs (VLLMs). This work differentiates itself from previous evaluations by emphasizing safety through a comprehensive suite of tests, covering both out-of-distribution (OOD) scenarios and adversarial robustness. The authors provide a detailed inspection of how VLLMs respond to unconventional inputs, aiming to ensure their secure integration into real-world applications.

Methodology

The paper introduces a two-pronged safety evaluation framework for VLLMs:

  1. Out-of-Distribution (OOD) Evaluation: The authors developed two novel Visual Question Answering (VQA) datasets—OODCV-VQA and Sketchy-VQA—each with a variant. These datasets are designed to test VLLMs' performance when faced with atypical visual inputs. The OODCV-VQA set includes images with unusual textures or rarely seen objects. Its variant introduces counterfactual descriptions to further challenge the models' comprehension abilities. Conversely, Sketchy-VQA focuses on sketch images, assessing models' ability to interpret minimalistic and abstract visual representations. The variant uses less common categories to enhance the difficulty.
  2. Redteaming Attacks: The paper also evaluates adversarial robustness of VLLMs through redteaming strategies. A novel attack method, targeting the vision encoder of VLLMs based on CLIP ViTs, is proposed to mislead models into generating irrelevant outputs. Furthermore, the authors test the efficacy of jailbreaking strategies to induce toxic outputs, assessing vulnerabilities in the vision or language components of these models.

Key Findings

The paper evaluates 21 VLLMs, including prominent ones like GPT-4V, through their proposed framework and provides several critical insights:

  • VLLMs demonstrate robust performance with OOD visual inputs but struggle significantly with OOD textual inputs. This highlights the importance of language inputs in determining the functionality of VLLMs.
  • Current VLLMs, including GPT-4V, face challenges in interpreting sketches, suggesting limitations in their ability to process abstract or minimalist visual information.
  • The proposed CLIP ViT-based attacks are highly effective, revealing that most VLLMs can be misled or fail to reject misleading inputs.
  • Current methods for vision-based jailbreaking aren't universally effective. Simple misleading attempts produce confused outputs but aren't consistently able to invoke specific toxic content.
  • Vision-language training processes appear to undermine established safety protocols in LLMs, with most vision-LLMs exhibiting weaker defensive capabilities compared to their purely LLM counterparts.

Implications and Future Developments

The research underscores significant implications for the application of VLLMs in real-world environments. The revealed weaknesses in handling OOD data and adversarial inputs highlight critical areas where VLLM technology could be vulnerable. Future research must focus on enhancing safety protocols during the vision-language training phase to mitigate these vulnerabilities.

Moreover, as the integration of VLLMs becomes more prevalent across applications, the development of more robust methodologies for evaluating model safety is imperative. Ensuring the alignment of VLLMs with rigorous safety standards is particularly important, not only for technical robustness but also for maintaining ethical standards as these models interact more deeply with users in various societal contexts.

This paper contributes significantly to the discourse by shedding light on these areas and advocating for comprehensive safety evaluation frameworks that can adapt alongside advancements in VLLM technology. The release of the proposed benchmark dataset will undoubtedly serve as a valuable resource for the ongoing development and fortification of VLLMs in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Reveal of vision transformers robustness against adversarial attacks. arXiv preprint arXiv:2106.03734, 2021.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  6. Introducing our multimodal models, 2023.
  7. Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. arXiv preprint arXiv:2110.02797, 2021.
  8. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  9. Google Brain. https://www.kaggle.com/competitions/nips-2017-non-targeted-adversarial-attack, 2017.
  10. Artificial intelligence, bias and clinical safety. BMJ Quality & Safety, 2019.
  11. Sketchygan: Towards diverse and realistic sketch to image synthesis. In CVPR, 2018.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  13. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  15. Write and paint: Generative vision-language models are unified modal learners. In ICLR, 2023.
  16. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  17. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022.
  18. How do humans sketch objects? SIGGRAPH, 2012.
  19. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  20. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  21. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
  22. Google Jigsaw. https://perspectiveapi.com/, 2023.
  23. Picture that sketch: Photorealistic image generation from abstract sketches. In CVPR, 2023.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  25. Distilling large vision-language model with out-of-distribution generalizability. In ICCV, 2023b.
  26. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  28. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
  29. OpenAI. Chatgpt can now see, hear, and speak, 2023a.
  30. OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023b.
  31. OpenAI. Gpt-4v(ision) technical work and authors. Technical report, OpenAI, 2023c.
  32. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
  33. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  34. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  35. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  36. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
  37. Agibench: A multi-granularity, multimodal, human-referenced, auto-scoring benchmark for large language models. arXiv preprint arXiv:2309.06495, 2023.
  38. Rachal Tatman. https://www.kaggle.com/datasets/rtatman/english-word-frequency, 2017.
  39. Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763, 2021.
  40. Mass-producing failures of multimodal systems with language models. arXiv preprint arXiv:2306.12105, 2023.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023b.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023c.
  44. Resee: Responding through seeing fine-grained visual knowledge in open-domain dialogue. arXiv preprint arXiv:2305.13602, 2023a.
  45. Zerogen: Zero-shot multimodal controllable text generation with multiple oracles. arXiv preprint arXiv:2306.16649, 2023b.
  46. Sight beyond text: Multi-modal training enhances llms in truthfulness and ethics. arXiv preprint arXiv:2309.07120, 2023c.
  47. Simplesafetytests: a test suite for identifying critical safety risks in large language models, 2023.
  48. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023a.
  49. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
  50. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023c.
  51. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
  52. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  53. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  54. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  55. What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In ICCVW, 2023a.
  56. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
  57. Ood-cv: a benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In European Conference on Computer Vision, pages 163–180. Springer, 2022.
  58. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  59. VLUE: A multi-task multi-dimension benchmark for evaluating vision-language pre-training. In ICML, 2022.
  60. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023a.
  61. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023b.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  63. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haoqin Tu (25 papers)
  2. Chenhang Cui (14 papers)
  3. Zijun Wang (22 papers)
  4. Yiyang Zhou (33 papers)
  5. Bingchen Zhao (46 papers)
  6. Junlin Han (23 papers)
  7. Wangchunshu Zhou (73 papers)
  8. Huaxiu Yao (103 papers)
  9. Cihang Xie (91 papers)
Citations (52)