Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreaking Black Box Large Language Models in Twenty Queries (2310.08419v4)

Published 12 Oct 2023 in cs.LG and cs.AI
Jailbreaking Black Box Large Language Models in Twenty Queries

Abstract: There is growing interest in ensuring that LLMs align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.

Analysis of "Jailbreaking Black Box LLMs in Twenty Queries"

The paper under discussion introduces Prompt Automatic Iterative Refinement (PAIR), a novel algorithm designed to efficiently generate semantic jailbreaks for LLMs with black-box access. This methodology leverages the capabilities of an attacker LLM to discover potential vulnerabilities in another target LLM by iteratively refining adversarial prompts until a breach occurs in fewer than twenty queries. This approach addresses the current challenges in the field by balancing the need for interpretability and query efficiency, which are often lacking in existing token-level jailbreak techniques.

Methodology and Design

PAIR orchestrates a dialogue between two models: the attacker and the target. The attacker LLM generates candidate prompts, aiming to provoke responses from the target model that circumvent its safety constraints. The innovative aspect of PAIR lies in its capacity to automate the generation of meaningful, semantic prompts, bypassing the manual, labor-intensive processes typically required for prompt-level attacks. The algorithm operates through an iterative process comprising four steps: attack generation, target response, jailbreak scoring, and iterative refinement. The loop repeats until a successful jailbreak is achieved or a pre-defined query limit is reached.

Empirical Evaluation

The experimental results demonstrate the effectiveness of PAIR across a range of open and closed LLMs, including GPT-3.5, GPT-4, Vicuna, and PaLM-2. Remarkably, PAIR consistently achieved a jailbreak success rate of 60% on models such as GPT-3.5 and GPT-4, with an average of just over a dozen queries. This efficiency represents a significant improvement over methods like Greedy Coordinate Gradient (GCG), which demands hundreds of thousands of queries and substantial computational resources. PAIR's capability to generalize attacks with minimal queries offers a practical advantage, particularly in settings with computational and time constraints.

Implications and Future Directions

PAIR's contributions have both practical and theoretical implications. Practically, the introduction of a streamlined, automated approach to uncovering model vulnerabilities can inform the development of more robust LLMs. Theoretically, PAIR invites further investigation into the dynamic interactions between competing LLMs and their influence on model safety and alignment. Moreover, the approach of semantic-level adversarial attacks emphasizes the complexity inherent in aligning LLM behavior with human values, given their susceptibility to social engineering tactics.

The potential future trajectory of AI developments includes enhancing the robustness of LLMs against such semantic-based attacks and exploring the optimization of PAIR's components. For instance, varying the design of the attacker's system prompt or integrating advanced LLMs could significantly influence the effectiveness of jailbreaks. Another avenue for future work could involve extending PAIR to multi-turn conversations or broader applications beyond generating harmful content, reinforcing model reliability in diverse scenarios.

In conclusion, the introduction of PAIR advances the understanding of LLM vulnerabilities and establishes a promising framework for stress-testing these models in a more efficient and interpretable manner. As AI systems become increasingly integrated into various domains, addressing these challenges remains crucial to ensuring their safe and ethical deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Palm 2 technical report, 2023.
  2. Constitutional ai: Harmlessness from ai feedback, 2022.
  3. Improving question answering model robustness with synthetic adversarial data generation. arXiv preprint arXiv:2104.08678, 2021a.
  4. Models in the loop: Aiding crowdworkers with generative annotation assistants. arXiv preprint arXiv:2112.09062, 2021b.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Are aligned neural networks adversarially aligned?, 2023.
  7. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019.
  8. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  9. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083, 2019.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
  12. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  13. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  14. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  15. Automatically auditing large language models via discrete optimization, 2023.
  16. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
  17. Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems, pages 195–231. Springer, 2018.
  18. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019.
  19. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  20. Black box adversarial prompting for foundation models, 2023.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
  23. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  24. Automatic prompt optimization with "gradient descent" and beam search, 2023.
  25. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
  26. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020.
  27. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  28. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  29. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  30. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  31. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  32. Large language models in medicine. Nature medicine, pages 1–11, 2023.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  34. Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations, 2019.
  35. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  36. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  37. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  38. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  39. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  40. Large language models are human-level prompt engineers, 2023.
  41. Universal and transferable adversarial attacks on aligned language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Patrick Chao (12 papers)
  2. Alexander Robey (34 papers)
  3. Edgar Dobriban (75 papers)
  4. Hamed Hassani (120 papers)
  5. George J. Pappas (208 papers)
  6. Eric Wong (47 papers)
Citations (396)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com