AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts (2410.22143v1)

Published 29 Oct 2024 in cs.CL

Abstract: Although LLMs are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17\% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.

PDF HTML Abstract

Analysis of "AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts"

The paper presents AmpleGCG-Plus, an enhancement over the previous AmpleGCG model, designed to efficiently generate gibberish adversarial suffixes for exploiting alignment gaps in LLMs. While the focus on adversarial prompts via natural language has been prevalent, gibberish suffixes present an intriguing method for jailbreaking LLMs. This method not only retains a high success rate in bypassing safety protocols but does so with fewer attempts compared to previous approaches.

Key Innovations and Results

AmpleGCG-Plus builds upon the initial groundwork established by AmpleGCG, introducing several modifications aimed at optimizing the production of gibberish adversarial suffixes. The enhancements include:

Model Optimization: The model leverages pre-trained LLMs, significantly improving performance over models initialized from scratch. Findings suggest that pre-training enhances the model's ability to generate unnatural suffixes, leveraging clustering capabilities developed during instruction tuning.
Increased Data Utilization: By utilizing a substantially larger training dataset compared to AmpleGCG, AmpleGCG-Plus harnesses a broader spectrum of examples to improve model generalization.
Improved Classifiers for Filtering: The paper emphasizes the use of stricter classification metrics for filtering training data, significantly reducing false positives in jailbreak evaluations. This change enhances the robustness of the generated suffixes.

These optimizations result in a notable increase in Attack Success Rate (ASR). For example, against the Llama-2-7B-Chat model, AmpleGCG-Plus achieves a 17% higher ASR compared to its predecessor and demonstrates more than a threefold increase in ASR against GPT-4 within the black-box setting.

Implications for Future Research

AmpleGCG-Plus reveals critical insights into the vulnerabilities of LLMs, specifically underscoring the limitations of current alignment strategies against non-semantic adversarial tactics. Such vulnerabilities highlight the necessity for more comprehensive alignment and safety measures beyond traditional natural language processing paradigms.

The model also extends its efficacy against newer defense mechanisms such as the circuit breakers, indicating the persistence of underlying vulnerabilities in LLM architectures irrespective of the sophistication of the safety measures applied. It points towards the requirement of developing defenses that are adept at identifying and mitigating gibberish or non-standard input patterns.

In practical terms, AmpleGCG-Plus offers a potent tool for red-teaming exercises in AI safety, presenting an efficient method for stress-testing LLM responses beyond conventional prompts. As LLM deployment becomes widespread, tools like AmpleGCG-Plus will be pivotal in ensuring that model outputs remain aligned with human values, especially for applications in sensitive domains.

Challenges and Future Directions

The paper acknowledges the reliance on current LLM evaluators for determining harmful outputs, suggesting room for improvement in classifier robustness and accuracy. Moreover, it hints at broader applications for gibberish suffix models, including integrating other jailbreak tactics like AutoDAN or PAIR.

A prospective research avenue involves extending the overgenerate-then-filter pipeline to encompass diverse LLM architectures, further examining the transferability of these adversarial suffixes across different model configurations. Moreover, exploring synergies between gibberish suffixes and semantic adversarial methods could yield holistic approaches to probing LLM vulnerabilities, potentially culminating in more resilient safety architectures.

In summary, AmpleGCG-Plus marks a significant advancement in the paper of adversarial attacks on LLMs, offering both practical tools for security assessments and sparking further inquiry into the resilience of LLMs against unconventional inputs.

PDF Markdown Bookmark Chat (Pro)

References (41)

Authors (4)

Vishal Kumar (19 papers)
Zeyi Liao (14 papers)
Jaylen Jones (3 papers)
Huan Sun (88 papers)

Tweets

https://twitter.com/VishalKumar0220/status/1851806940361560265