Defending Jailbreak Prompts via In-Context Adversarial Game (2402.13148v2)
Abstract: LLMs demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
- Alex Albert. Jailbreakchat. https://www.jailbreakchat.com/.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
- Chateval: Towards better llm-based evaluators through multi-agent debate.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- Improving factuality and reasoning in language models through multiagent debate.
- Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694.
- Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680.
- What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365.
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
- OpenAI. 2023. Gpt-4 technical report.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287.
- Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Defending chatgpt against jailbreak attack via self-reminder.
- Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
- Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
- Expel: Llm agents are experiential learners. arXiv preprint arXiv:2308.10144.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Yujun Zhou (14 papers)
- Yufei Han (26 papers)
- Haomin Zhuang (10 papers)
- Kehan Guo (16 papers)
- Zhenwen Liang (22 papers)
- Hongyan Bao (4 papers)
- Xiangliang Zhang (131 papers)