Can Large Language Models Automatically Jailbreak GPT-4V?
Abstract: GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage LLMs for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
- Touchstone: Evaluating vision-language models by language models.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Deep reinforcement learning from human preferences.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- How robust is google’s bard to adversarial image attacks?
- Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608.
- Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Deepinception: Hypnotize large language model to be jailbreaker.
- Evaluating object hallucination in large vision-language models.
- I think, therefore i am: Awareness in large language models.
- Goat-bench: Safety insights to large multimodal models through meme-based social abuse.
- Visual instruction tuning.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Safety of multimodal large language models on images and text.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
- Umap: Uniform manifold approximation and projection for dimension reduction.
- OpenAI. 2023a. Gpt-4v(ision) system card.
- OpenAI. 2023b. Openai embedding models.
- OpenAI. 2023c. Openai moderation api. https://platform.openai.com/docs/guides/moderation.
- Training language models to follow instructions with human feedback.
- Emanuel Parzen. 1962. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076.
- Lutz Prechelt. 2002. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer.
- Visual adversarial examples jailbreak aligned large language models.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Towards an exhaustive evaluation of vision-language foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 339–352.
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
- shujujishi. 2019. Face datasets of ten chinese stars.
- Why do universal adversarial attacks work on large language models?: Geometry might be the answer.
- Temporal insight enhancement: Mitigating temporal hallucination in multimodal large language models. arXiv preprint arXiv:2401.09861.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
- VISHESH THAKUR. 2022. Celebrity face image dataset.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
- Haoran Wang and Kai Shu. 2023. Backdoor activation attack: Attack large language models using activation steering for safety-alignment.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Defending chatgpt against jailbreak attack via self-reminder.
- Jailbreaking gpt-4v via self-adversarial attacks with system prompts.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- Large language models as optimizers.
- On early stopping in gradient descent learning. Constructive Approximation, 26:289–315.
- A survey on multimodal large language models.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
- Low-resource languages jailbreak gpt-4.
- Vision-language models for vision tasks: A survey.
- On evaluating adversarial robustness of large vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Robust prompt optimization for defending language models against jailbreaking attacks.
- Safety fine-tuning at (almost) no cost: A baseline for vision large language models.
- Universal and transferable adversarial attacks on aligned language models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.