AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (2404.07921v3)

Published 11 Apr 2024 in cs.CL

Abstract: As LLMs become increasingly prevalent and integrated into autonomous systems, ensuring their safety is imperative. Despite significant strides toward safety alignment, recent work GCG~\citep{zou2023universal} proposes a discrete token optimization algorithm and selects the single suffix with the lowest loss to successfully jailbreak aligned LLMs. In this work, we first discuss the drawbacks of solely picking the suffix with the lowest loss during GCG optimization for jailbreaking and uncover the missed successful suffixes during the intermediate steps. Moreover, we utilize those successful suffixes as training data to learn a generative model, named AmpleGCG, which captures the distribution of adversarial suffixes given a harmful query and enables the rapid generation of hundreds of suffixes for any harmful queries in seconds. AmpleGCG achieves near 100\% attack success rate (ASR) on two aligned LLMs (Llama-2-7B-chat and Vicuna-7B), surpassing two strongest attack baselines. More interestingly, AmpleGCG also transfers seamlessly to attack different models, including closed-source LLMs, achieving a 99\% ASR on the latest GPT-3.5. To summarize, our work amplifies the impact of GCG by training a generative model of adversarial suffixes that is universal to any harmful queries and transferable from attacking open-source LLMs to closed-source LLMs. In addition, it can generate 200 adversarial suffixes for one harmful query in only 4 seconds, rendering it more challenging to defend.

PDF HTML Abstract

Overview of "AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs"

This paper presents a novel approach in the field of LLM security, focusing on the development of AmpleGCG, a generative model designed to create adversarial suffixes that can efficiently "jailbreak" both open and closed LLMs. The authors build upon previous work, notably the Greedy Coordinate Gradient (GCG) method, to improve the attack success rate (ASR) and efficiency of generating adversarial suffixes that can bypass the safety measures embedded in contemporary LLMs.

Key Contributions

Limitations of GCG: The paper begins by critiquing existing methods like GCG, which focus on optimizing a single adversarial suffix per query and rely heavily on the loss function as a metric for success. It is identified that low loss does not correlate strongly with successful jailbreak attempts due to factors like high loss at the initial token leading to overall safe outputs.
Augmented GCG: From this critique, the authors develop what they call "augmented GCG," which maintains a collection of all candidate suffixes sampled during optimizations. This leads to a drastically improved ASR—from about 20% to 80% on certain LLMs—by uncovering successful adversarial suffixes overlooked by conventional methods.
AmpleGCG: Building further, AmpleGCG is introduced as a generative model trained on the successful adversarial suffixes identified through augmented GCG. It demonstrates a near 100% ASR on models like Vicuna-7B and Llama-2-7B-Chat using only 200 suffixes, highlighting its efficiency and capacity to uncover a wider array of vulnerabilities.
Transferability and Efficiency: AmpleGCG exhibits strong transferability, successfully attacking both unseen open-source and closed-source models, including newer versions of GPT-3.5. It boasts a remarkable reduction in time cost, generating 200 adversarial suffixes in approximately 4 seconds.
Defense Evasion: The paper also discusses methods to circumvent defense mechanisms such as perplexity-based detectors by using techniques like repeating harmful queries, enhancing AmpleGCG’s robustness against potential detection strategies.

Implications and Future Directions

The implications of this work are twofold: practically, it emphasizes the persistent and evolving challenge of securing LLMs against adversarial attacks. Theoretically, it raises questions about the robustness of using loss as a sole metric for attack success and highlights the potential for generative models to learn complex adversarial distributions.

The authors suggest that future developments could refine the AmpleGCG approach, perhaps by integrating more sophisticated models and evaluators to further bolster its effectiveness and transferability. Also, exploring alignment strategies that can handle such generative-based attacks without compromising LLM capabilities might be a crucial area of investigation.

Conclusion

In summary, the paper provides a rigorous exploration and enhancement of adversarial suffix generation by leveraging a more comprehensive understanding of LLM vulnerabilities. By developing AmpleGCG, the authors present a potent tool that underscores the need for more fundamental and robust solutions to protect LLMs from misuse, highlighting both the challenges and the evolving sophistication of adversarial techniques in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Zeyi Liao (14 papers)
Huan Sun (88 papers)

Citations (47)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/hhsun1/status/1785477641279320271

https://twitter.com/LiaoZeyi/status/1785001104536953138

https://twitter.com/hhsun1/status/1788386368885813318

https://twitter.com/dpaleka/status/1785458979768037391

https://twitter.com/LiaoZeyi/status/1785000689741259133

https://twitter.com/LiaoZeyi/status/1803130292682543438