BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks (2410.20971v2)
Abstract: In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
- Are aligned neural networks adversarially aligned? NeurIPS, 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505, 2023.
- Masterkey: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS, 2024.
- Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models. arXiv preprint arXiv:2406.05491, 2024.
- Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. In 2024 IEEE S&P Workshops, pp. 132–143. IEEE, 2024.
- Break the breakout: Reinventing lm defense against jailbreak attacks with self-refinement. arXiv preprint arXiv:2402.15180, 2024.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023b.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023c.
- Visual instruction tuning. NeurIPS, 36, 2024a.
- Tiny refinements elicit resilience: Toward efficient prefix-model against llm red-teaming. arXiv preprint arXiv:2405.12604, 2024b.
- Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600, 2023.
- Protecting your llms with information bottleneck. arXiv preprint arXiv:2404.13968, 2024c.
- Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027, 2024.
- Diffusion models for adversarial purification. In ICML, 2022.
- Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309, 2024.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. PMLR, 2021.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
- Spml: A dsl for defending language models against prompt attacks. arXiv preprint arXiv:2402.11755, 2024.
- Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
- Exploring the adversarial capabilities of large language models. arXiv preprint arXiv:2402.09132, 2024.
- White-box multimodal jailbreaks against large vision-language models. arXiv preprint arXiv:2405.17894, 2024.
- Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis. arXiv preprint arXiv:2402.13494, 2024.
- Defending jailbreak attack in vlms via cross-modality information detector. arXiv preprint arXiv:2407.21659, 2024a.
- Safedecoding: Defending against jailbreak attacks via safety-aware decoding. arXiv preprint arXiv:2402.08983, 2024b.
- Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031, 2024.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024a.
- Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024b.
- A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766, 2023.
- Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024.
- On prompt-driven safeguarding for large language models. In ICML, 2024.
- Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Is the system message really important to jailbreaks in large language models? arXiv preprint arXiv:2402.14857, 2024.