Safety Alignment for Vision Language Models (2405.13581v1)
Abstract: Benefiting from the powerful capabilities of LLMs, pre-trained visual encoder models connected to an LLMs can realize Vision LLMs (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual LLMs (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- Image hijacks: Adversarial images can control generative models at runtime, 2023.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Image safeguarding: Reasoning with conditional vision language model and obfuscating unsafe content counterfactually, 2024.
- Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Antifakeprompt: Prompt-tuned vision-language models are fake image detectors, 2023.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
- Inducing high energy-latency of large vision-language models with verbose images, 2024.
- Figstep: Jailbreaking large vision-language models via typographic visual prompts, 2023.
- Onellm: One framework to align all modalities with language. arXiv preprint arXiv:2312.03700, 2023.
- Lora: Low-rank adaptation of large language models, 2021.
- Kim, A. Nsfw data scraper. https://github.com/alex000kim/nsfw_data_scraper, 2021.
- A hierarchical approach for generating descriptive image paragraphs, 2017.
- Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023a.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023c.
- Red teaming visual language models, 2024.
- Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models, 2024.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
- Mm-safetybench: A benchmark for safety evaluation of multimodal large language models, 2024c.
- Mmbench: Is your multi-modal model an all-around player?, 2023c.
- Deep learning face attributes in the wild, 2015.
- Stable bias: Evaluating societal representations in diffusion models. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 56338–56351. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/b01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf.
- OpenAI. Gpt-4 technical report, 2024.
- QResearch. llama3-vision-alpha. https://huggingface.co/qresearch/llama-3-vision-alpha, 2024.
- Learning transferable visual models from natural language supervision, 2021.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 2024.
- How many unicorns are in this image? a safety evaluation benchmark for vision llms, 2023.
- Towards understanding and detecting cyberbullying in real-world images. In Proceedings of the 28th Annual Network and Distributed System Security Symposium. Internet Society, 2021.
- Knowledge mining with scene text for fine-grained recognition, 2022.
- Jailbroken: How does llm safety training fail?, 2023.
- Adversarial prompt tuning for vision-language models, 2023a.
- A mutation-based method for multi-modal jailbreaking attack detection, 2023b.
- Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023c.
- Privacyalert: A dataset for image privacy prediction. Proceedings of the International AAAI Conference on Web and Social Media, 16(1):1352–1361, May 2022. doi: 10.1609/icwsm.v16i1.19387. URL https://ojs.aaai.org/index.php/ICWSM/article/view/19387.
- On evaluating adversarial robustness of large vision-language models, 2023.
- Mquake: Assessing knowledge editing in language models via multi-hop questions, 2023.
- Safety fine-tuning at (almost) no cost: A baseline for vision large language models, 2024.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Zhendong Liu (16 papers)
- Yuanbi Nie (2 papers)
- Yingshui Tan (23 papers)
- Xiangyu Yue (93 papers)
- Qiushi Cui (6 papers)
- Chongjun Wang (27 papers)
- Xiaoyong Zhu (12 papers)
- Bo Zheng (205 papers)