Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models (2407.21659v4)

Published 31 Jul 2024 in cs.CL

Abstract: Multimodal LLMs (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively, achieving remarkable performance in many vision-centric tasks. Despite that, recent studies have shown that these models are susceptible to jailbreak attacks, which refer to an exploitative technique where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. This potential threat is caused by both the inherent vulnerabilities of LLM and the larger attack scope introduced by vision input. To enhance the security of MLLMs against jailbreak attacks, researchers have developed various defense techniques. However, these methods either require modifications to the model's internal structure or demand significant computational resources during the inference phase. Multimodal information is a double-edged sword. While it increases the risk of attacks, it also provides additional data that can enhance safeguards. Inspired by this, we propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs, utilizing the cross-modal similarity between harmful queries and adversarial images. CIDER is independent of the target MLLMs and requires less computation cost. Extensive experimental results demonstrate the effectiveness and efficiency of CIDER, as well as its transferability to both white-box and black-box MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  3. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36.
  4. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  5. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  8. Masterkey: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  11. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608.
  12. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  13. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
  14. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. arXiv preprint arXiv:2403.09792.
  15. Visual instruction tuning. Advances in neural information processing systems, 36.
  16. Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600.
  17. Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027.
  18. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249.
  19. MM-Vet-Evaluator. 2024. MM-Vet Evaluator - a Hugging Face Space by whyu — huggingface.co. https://huggingface.co/spaces/whyu/MM-Vet_Evaluator. [Accessed 15-06-2024].
  20. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR.
  21. Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309.
  22. Summary generation using natural language processing techniques and cosine similarity. In International Conference on Intelligent Systems Design and Applications, pages 508–517. Springer.
  23. A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence, 34(5):396–411.
  24. Mllm-protector: Ensuring mllm’s safety without hurting performance. arXiv preprint arXiv:2401.02906.
  25. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21527–21536.
  26. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206.
  29. Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. arXiv preprint arXiv:2403.09513.
  30. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  31. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  32. A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766.
  33. A survey of large language models. arXiv preprint arXiv:2303.18223.
  34. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  35. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207.
  36. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yue Xu (79 papers)
  2. Xiuyuan Qi (2 papers)
  3. Zhan Qin (54 papers)
  4. Wenjie Wang (150 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets