Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts (2410.22143v1)

Published 29 Oct 2024 in cs.CL

Abstract: Although LLMs are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17\% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.

Analysis of "AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts"

The paper presents AmpleGCG-Plus, an enhancement over the previous AmpleGCG model, designed to efficiently generate gibberish adversarial suffixes for exploiting alignment gaps in LLMs. While the focus on adversarial prompts via natural language has been prevalent, gibberish suffixes present an intriguing method for jailbreaking LLMs. This method not only retains a high success rate in bypassing safety protocols but does so with fewer attempts compared to previous approaches.

Key Innovations and Results

AmpleGCG-Plus builds upon the initial groundwork established by AmpleGCG, introducing several modifications aimed at optimizing the production of gibberish adversarial suffixes. The enhancements include:

  1. Model Optimization: The model leverages pre-trained LLMs, significantly improving performance over models initialized from scratch. Findings suggest that pre-training enhances the model's ability to generate unnatural suffixes, leveraging clustering capabilities developed during instruction tuning.
  2. Increased Data Utilization: By utilizing a substantially larger training dataset compared to AmpleGCG, AmpleGCG-Plus harnesses a broader spectrum of examples to improve model generalization.
  3. Improved Classifiers for Filtering: The paper emphasizes the use of stricter classification metrics for filtering training data, significantly reducing false positives in jailbreak evaluations. This change enhances the robustness of the generated suffixes.

These optimizations result in a notable increase in Attack Success Rate (ASR). For example, against the Llama-2-7B-Chat model, AmpleGCG-Plus achieves a 17% higher ASR compared to its predecessor and demonstrates more than a threefold increase in ASR against GPT-4 within the black-box setting.

Implications for Future Research

AmpleGCG-Plus reveals critical insights into the vulnerabilities of LLMs, specifically underscoring the limitations of current alignment strategies against non-semantic adversarial tactics. Such vulnerabilities highlight the necessity for more comprehensive alignment and safety measures beyond traditional natural language processing paradigms.

The model also extends its efficacy against newer defense mechanisms such as the circuit breakers, indicating the persistence of underlying vulnerabilities in LLM architectures irrespective of the sophistication of the safety measures applied. It points towards the requirement of developing defenses that are adept at identifying and mitigating gibberish or non-standard input patterns.

In practical terms, AmpleGCG-Plus offers a potent tool for red-teaming exercises in AI safety, presenting an efficient method for stress-testing LLM responses beyond conventional prompts. As LLM deployment becomes widespread, tools like AmpleGCG-Plus will be pivotal in ensuring that model outputs remain aligned with human values, especially for applications in sensitive domains.

Challenges and Future Directions

The paper acknowledges the reliance on current LLM evaluators for determining harmful outputs, suggesting room for improvement in classifier robustness and accuracy. Moreover, it hints at broader applications for gibberish suffix models, including integrating other jailbreak tactics like AutoDAN or PAIR.

A prospective research avenue involves extending the overgenerate-then-filter pipeline to encompass diverse LLM architectures, further examining the transferability of these adversarial suffixes across different model configurations. Moreover, exploring synergies between gibberish suffixes and semantic adversarial methods could yield holistic approaches to probing LLM vulnerabilities, potentially culminating in more resilient safety architectures.

In summary, AmpleGCG-Plus marks a significant advancement in the paper of adversarial attacks on LLMs, offering both practical tools for security assessments and sparking further inquiry into the resilience of LLMs against unconventional inputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Maksym Andriushchenko and Nicolas Flammarion. 2024. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969.
  3. Many-shot jailbreaking. Anthropic, April.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
  8. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations.
  9. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  10. Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations.
  11. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  12. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  13. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510.
  14. Mission: Impossible language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14691–14714, Bangkok, Thailand. Association for Computational Linguistics.
  15. Zeyi Liao and Huan Sun. 2024. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed LLMs. In First Conference on Language Modeling.
  16. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. arXiv preprint arXiv:2410.05295.
  17. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations.
  18. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In Forty-first International Conference on Machine Learning.
  19. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119.
  20. Raphael Milliere. 2024. Language models as models of language. arXiv preprint arXiv:2408.07144.
  21. Zvi Mowshowitz. 2022. Jailbreaking ChatGPT on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day. Accessed: 2024-09-29.
  22. OpenAI. 2024. Gpt-4o system card. Technical report, OpenAI.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  24. Advprompter: Fast adaptive adversarial prompting for llms. arXiv preprint arXiv:2404.16873.
  25. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  26. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  27. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  28. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  29. A strongREJECT for empty jailbreaks. In Thirty-eighth Conference on Neural Information Processing Systems.
  30. T. Ben Thompson and Michael Sklar. 2024a. Breaking circuit breakers. https://confirmlabs.org/posts/circuit_breaking.html.
  31. T Ben Thompson and Michael Sklar. 2024b. Fluent student-teacher redteaming. arXiv preprint arXiv:2407.17447.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  34. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
  35. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  36. Xinbo Wu and Lav R Varshney. 2024. Transformer-based causal language models perform clustering. arXiv preprint arXiv:2402.12151.
  37. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  38. How johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350, Bangkok, Thailand. Association for Computational Linguistics.
  39. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
  40. Improving alignment and robustness with short circuiting. arXiv preprint arXiv:2406.04313.
  41. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vishal Kumar (19 papers)
  2. Zeyi Liao (14 papers)
  3. Jaylen Jones (3 papers)
  4. Huan Sun (88 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com