AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation (2410.09040v1)

Published 11 Oct 2024 in cs.CL

Abstract: This paper studies the vulnerabilities of transformer-based LLMs to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.

Authors (6)

Zijun Wang (22 papers)
Haoqin Tu (25 papers)
Jieru Mei (26 papers)
Bingchen Zhao (47 papers)
Yisen Wang (120 papers)
Cihang Xie (91 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces AttnGCG, a method that enhances jailbreaking attacks by manipulating attention scores in transformer-based LLMs.
It integrates an auxiliary attention loss into the GCG strategy, yielding ASR increases of up to 10% across models like Llama-2 and Gemma.
The method shows robust transferability, successfully increasing attack success rates on unseen harmful prompts and black-box models such as GPT-3.5 and GPT-4.

Enhancing Jailbreak Attacks on LLMs: The AttnGCG Approach

The paper focuses on the vulnerabilities of transformer-based LLMs to jailbreaking attacks, emphasizing the optimization-based Greedy Coordinate Gradient (GCG) strategy. The authors introduce an enhanced method called AttnGCG, which manipulates the models' attention scores to improve the efficacy of jailbreaking.

Observations and Methodology

The authors first identify a positive correlation between effective attacks and internal model behaviors. Notably, attacks are less effective when models focus on system prompts designed for safety alignment. This observation leads to the proposal of AttnGCG, which modifies attention scores to facilitate jailbreaking.

AttnGCG integrates an auxiliary attention loss into the traditional GCG objective by directing models' focus towards adversarial suffixes. This approach not only increases the attack success rate across various LLMs but also enhances interpretability of attention-score visualizations. Empirical results show average ASR increases of 7% in the Llama-2 series and 10% in the Gemma series. Furthermore, the strategy exhibits robust transferability against unseen harmful goals and black-box models like GPT-3.5 and GPT-4.

Strong Numerical Results

The paper highlights several key numerical improvements:

Llama-2 Series: An average ASR increase of 7%.
Gemma Series: An average ASR increase of 10%.
Transferability: 11.4% increase in ASR for unseen goals and 2.8% for black-box models.

These results emphasize the enhanced effectiveness of the AttnGCG method over standard GCG and demonstrate its robustness across varied testing conditions.

Implications and Future Work

The research highlights the necessity for heightened attention to adversarial prompt crafting and LLM security mechanisms. The ability of AttnGCG to transfer effectively to other models suggests that understanding and manipulating attention mechanisms could be pivotal in developing stronger defensive strategies within LLMs.

Future research should explore further optimization of attention-directed strategies, investigate defenses against such attacks, and consider the ethical implications of enhanced jailbreaking methods. Insights gained from attention-score visualizations might inform more robust alignments of LLMs with safety protocols, contributing to safer AI deployments.

In summary, the paper provides substantial insights into improving and understanding jailbreaking attacks on LLMs using attention manipulation. It paves the way for future explorations in both offensive and defensive strategies within the field of LLM safety and alignment.

PDF Markdown

Related Papers

GitHub

GitHub - UCSC-VLAA/AttnGCG-attack (4 stars)

Tweets

https://twitter.com/cihangxie/status/1846695671799468202

https://twitter.com/zijun_wang2002/status/1846703613927968979