ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (2410.18469v4)
Abstract: Recent research has shown that LLMs are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.
- Jailbreaking leading safety-aligned llms with simple adaptive attacks. CoRR.
- Jailbreaking black box large language models in twenty queries. CoRR.
- Qlora: Efficient finetuning of quantized llms. arXiv.
- The llama 3 herd of models. CoRR.
- Cold-attack: Jailbreaking llms with stealthiness and controllability. In ICML.
- Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR.
- Baseline defenses for adversarial attacks against aligned language models. CoRR.
- Improved techniques for optimization-based jailbreaking on large language models. CoRR.
- Mistral 7b. arXiv.
- Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. CoRR.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. In ICLR.
- OpenAI. 2023. GPT-4 technical report. CoRR.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Advprompter: Fast adaptive adversarial prompting for llms. CoRR.
- Fast adversarial attacks on language models in one GPU minute. In ICML.
- Llama 2: Open foundation and fine-tuned chat models. CoRR.
- Diverse beam search: Decoding diverse solutions from neural sequence models. CoRR.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Autodan: Interpretable gradient-based adversarial attacks on large language models.
- Universal and transferable adversarial attacks on aligned language models. CoRR.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.