Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image (2403.02910v2)

Published 5 Mar 2024 in cs.CV and cs.AI

Abstract: There has been an increasing interest in the alignment of LLMs with human values. However, the safety issues of their integration with a vision module, or vision LLMs (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images. Moreover, we analyze the effect of poison ratios and positions of trainable parameters on our attack's success rate. For evaluation, we design two metrics to quantify the success rate and the stealthiness of our attack. Together with a list of curated harmful instructions, a benchmark for measuring attack efficacy is provided. We demonstrate the efficacy of our attack by comparing it with baseline methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  3. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  4. Sharegpt4v: Improving large multi-modal models with better captions. ArXiv preprint, abs/2311.12793.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
  7. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566.
  8. Obelics: An open web-scale filtered dataset of interleaved image-text documents.
  9. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
  10. Red teaming visual language models. arXiv preprint arXiv:2401.12915.
  11. Improved baselines with visual instruction tuning.
  12. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  13. Visual instruction tuning. ArXiv preprint, abs/2304.08485.
  14. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  15. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  16. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  17. OpenAI. 2023. Gpt-4v(ision) system card.
  18. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  19. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213.
  20. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
  21. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv preprint, abs/2111.02114.
  22. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models.
  23. How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101.
  24. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  25. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  26. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv preprint arXiv:2402.11208.
  27. Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256.
  28. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  29. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xijia Tao (2 papers)
  2. Shuai Zhong (4 papers)
  3. Lei Li (1293 papers)
  4. Qi Liu (485 papers)
  5. Lingpeng Kong (134 papers)
Citations (14)
X Twitter Logo Streamline Icon: https://streamlinehq.com