Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks (2410.20971v2)

Published 28 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  3. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  4. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  6. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  7. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023.
  8. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
  9. Are aligned neural networks adversarially aligned? NeurIPS, 2024.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  12. Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505, 2023.
  13. Masterkey: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS, 2024.
  14. Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models. arXiv preprint arXiv:2406.05491, 2024.
  16. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
  17. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  18. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. In 2024 IEEE S&P Workshops, pp.  132–143. IEEE, 2024.
  19. Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement. arXiv preprint arXiv:2402.15180, 2024.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  21. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023b.
  22. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023c.
  23. Visual instruction tuning. NeurIPS, 36, 2024a.
  24. Tiny refinements elicit resilience: Toward efficient prefix-model against llm red-teaming. arXiv preprint arXiv:2405.12604, 2024b.
  25. Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600, 2023.
  26. Protecting your llms with information bottleneck. arXiv preprint arXiv:2404.13968, 2024c.
  27. Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027, 2024.
  28. Diffusion models for adversarial purification. In ICML, 2022.
  29. Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309, 2024.
  30. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  31. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
  32. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  33. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763. PMLR, 2021.
  34. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  35. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  36. Spml: A dsl for defending language models against prompt attacks. arXiv preprint arXiv:2402.11755, 2024.
  37. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
  38. Exploring the adversarial capabilities of large language models. arXiv preprint arXiv:2402.09132, 2024.
  39. White-box multimodal jailbreaks against large vision-language models. arXiv preprint arXiv:2405.17894, 2024.
  40. Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis. arXiv preprint arXiv:2402.13494, 2024.
  41. Defending jailbreak attack in vlms via cross-modality information detector. arXiv preprint arXiv:2407.21659, 2024a.
  42. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983, 2024b.
  43. Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031, 2024.
  44. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  45. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  46. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024a.
  47. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024b.
  48. A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766, 2023.
  49. Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024.
  50. On prompt-driven safeguarding for large language models. In ICML, 2024.
  51. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024.
  52. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  53. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  54. Is the system message really important to jailbreaks in large language models? arXiv preprint arXiv:2402.14857, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.