AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts (2410.22143v1)
Abstract: Although LLMs are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17\% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Maksym Andriushchenko and Nicolas Flammarion. 2024. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969.
- Many-shot jailbreaking. Anthropic, April.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
- Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations.
- QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
- Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510.
- Mission: Impossible language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14691–14714, Bangkok, Thailand. Association for Computational Linguistics.
- Zeyi Liao and Huan Sun. 2024. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed LLMs. In First Conference on Language Modeling.
- Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. arXiv preprint arXiv:2410.05295.
- AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations.
- Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In Forty-first International Conference on Machine Learning.
- Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119.
- Raphael Milliere. 2024. Language models as models of language. arXiv preprint arXiv:2408.07144.
- Zvi Mowshowitz. 2022. Jailbreaking ChatGPT on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day. Accessed: 2024-09-29.
- OpenAI. 2024. Gpt-4o system card. Technical report, OpenAI.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Advprompter: Fast adaptive adversarial prompting for llms. arXiv preprint arXiv:2404.16873.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- A strongREJECT for empty jailbreaks. In Thirty-eighth Conference on Neural Information Processing Systems.
- T. Ben Thompson and Michael Sklar. 2024a. Breaking circuit breakers. https://confirmlabs.org/posts/circuit_breaking.html.
- T Ben Thompson and Michael Sklar. 2024b. Fluent student-teacher redteaming. arXiv preprint arXiv:2407.17447.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Xinbo Wu and Lav R Varshney. 2024. Transformer-based causal language models perform clustering. arXiv preprint arXiv:2402.12151.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- How johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350, Bangkok, Thailand. Association for Computational Linguistics.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
- Improving alignment and robustness with short circuiting. arXiv preprint arXiv:2406.04313.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.