Investigating Adversarial Trigger Transfer in Large Language Models (2404.16020v2)

Published 24 Apr 2024 in cs.CL

Abstract: Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned LLMs. These triggers are believed to be highly transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not consistently transferable. We extensively investigate trigger transfer amongst 13 open models and observe poor and inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned LLMs.

References (57)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that triggers optimized with the Greedy Coordinate Gradient method do not reliably transfer across different LLMs.
Aligned by preference optimization models, such as Llama2 and Starling, exhibit strong resistance while fine-tuned models are notably vulnerable.
Generalization experiments reveal that triggers on fine-tuned models extend to new unsafe instructions, underscoring the need for rigorous safety evaluations.

This paper investigates the transferability of adversarial triggers for jailbreaking aligned LLMs. The authors empirically demonstrate that adversarial triggers generated using the Greedy Coordinate Gradient (GCG) method do not reliably transfer across different models, particularly those aligned by preference optimization.

The paper highlights the following key observations:

Inconsistent Transferability: The paper challenges the notion of universal transferability of adversarial triggers. Through experiments on $13$ open models, the authors find that triggers optimized on one model often fail to jailbreak other models.
Robustness of Aligned by Preference Optimization (APO) Models: Models aligned using preference optimization techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), exhibit significant robustness against adversarial triggers. Even triggers optimized directly on these models show limited transferability to other models. Examples of APO models used in the paper include Gemma, Llama2, and Starling.
Vulnerability of Aligned by Fine-Tuning (AFT) Models: The research indicates that models aligned by fine-tuning (AFT) are more susceptible to adversarial triggers. Despite appearing safe on the surface by refusing unsafe instructions, AFT models can be easily jailbroken with adversarial triggers. Examples of AFT models used in the paper include Koala, Vicuna, and MPT-7B-Chat.
Generalization to New Unsafe Instructions: Triggers optimized on AFT models demonstrate a capacity to generalize to new, unseen unsafe instructions across diverse domains, further emphasizing the vulnerability of these models.
Experimental Setup: The paper uses examples from AdvBench for trigger optimization and evaluation. The Attack Success Rate (ASR) is defined as the proportion of instructions to which a model responds harmfully. The $\Delta$ ASR is the difference in ASRs obtained with and without appending the trigger to the input. Llama-Guard is used to detect whether triggers jailbreak models.
Greedy Coordinate Gradient (GCG): The paper utilizes GCG, a white-box attack method, to find adversarial triggers. GCG uses the gradient to find token sequences that jailbreak LLMs by iteratively updating the trigger to minimize the cross-entropy loss of affirmative responses to harmful instructions.
Models Used: The paper includes a range of open models, including Gemma (Instruct-2B and Instruct-7B), Guanaco (7B and 13B), Llama2 (7B-Chat and 13B-Chat), MPT-7B-Chat, OpenChat-3.5-7B, Starling-7B ( $\alpha$ and $\beta$ ), Vicuna (7B and 13B), and Koala-7B.

The authors conduct several experiments to assess trigger transferability:

Transfer from Existing Ensembles: The paper evaluates the transferability of triggers optimized using model ensembles from a previous work. The results show inconsistent transfer across models, with APO models being particularly resistant to jailbreaking.
Transfer Among APO Models: Triggers optimized on APO models are tested for transferability to other APO models. The paper finds limited transferability, even within the same model family.
Alignment by Fine-Tuning Analysis: The paper investigates the robustness of AFT models against adversarial triggers. AFT models are found to converge faster and be more susceptible to jailbreaking compared to APO models.
Instruction Generalization: The paper examines the generalization of triggers optimized on AdvBench to other safety benchmarks. The results indicate that triggers can generalize well to unseen unsafe instructions, particularly for AFT models.

The paper concludes that while AFT models may appear safe due to their ability to refuse unsafe instructions, they lack the adversarial robustness of APO models. The authors suggest that more comprehensive safety evaluations, including automated red-teaming, are needed to assess model robustness.

The paper acknowledges limitations, including the focus on a single attack method (GCG), the lack of evaluation of response relevance and helpfulness, and the potential for triggers to not fully converge within the allotted optimization time.

In summary, this paper provides empirical evidence against the universal transferability of adversarial triggers, highlighting the robustness of APO models and the vulnerability of AFT models. The findings underscore the need for more rigorous safety evaluations and the development of more robust alignment techniques.

PDF Markdown

GitHub

GitHub - McGill-NLP/AdversarialTriggers: Code for "Universal Adversarial Triggers Are Not Universal" (17 stars)

Tweets

https://twitter.com/ncmeade/status/1783494660020732042

https://twitter.com/fly51fly/status/1783609836825129426

https://twitter.com/arxivsanitybot/status/1784041122660299210

https://twitter.com/Reworr_R/status/1851364013617398201

Investigating Adversarial Trigger Transfer in Large Language Models (2404.16020v2)

Summary

Related Papers

GitHub

Tweets