Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (2410.18469v4)

Published 24 Oct 2024 in cs.CL and cs.LG

Abstract: Recent research has shown that LLMs are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Jailbreaking leading safety-aligned llms with simple adaptive attacks. CoRR.
  2. Jailbreaking black box large language models in twenty queries. CoRR.
  3. Qlora: Efficient finetuning of quantized llms. arXiv.
  4. The llama 3 herd of models. CoRR.
  5. Cold-attack: Jailbreaking llms with stealthiness and controllability. In ICML.
  6. Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR.
  7. Baseline defenses for adversarial attacks against aligned language models. CoRR.
  8. Improved techniques for optimization-based jailbreaking on large language models. CoRR.
  9. Mistral 7b. arXiv.
  10. Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. CoRR.
  11. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In ICLR.
  12. OpenAI. 2023. GPT-4 technical report. CoRR.
  13. Training language models to follow instructions with human feedback. In NeurIPS.
  14. Advprompter: Fast adaptive adversarial prompting for llms. CoRR.
  15. Fast adversarial attacks on language models in one GPU minute. In ICML.
  16. Llama 2: Open foundation and fine-tuned chat models. CoRR.
  17. Diverse beam search: Decoding diverse solutions from neural sequence models. CoRR.
  18. Judging llm-as-a-judge with mt-bench and chatbot arena.
  19. Autodan: Interpretable gradient-based adversarial attacks on large language models.
  20. Universal and transferable adversarial attacks on aligned language models. CoRR.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com