Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (2405.21018v2)

Published 31 May 2024 in cs.LG, cs.CL, and cs.CR

Abstract: LLMs are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

PDF HTML Abstract

Improved Techniques for Optimization-Based Jailbreaking on LLMs

The paper "Improved Techniques for Optimization-Based Jailbreaking on LLMs" focuses on advancing the field of adversarial attacks on LLMs. The authors address the limitations of previous optimization-based jailbreak methods, specifically critiquing the Greedy Coordinate Gradient (GCG) attack for its subpar efficiency. They propose an improved version of GCG, dubbed $\mathcal{I}$ -GCG, incorporating several novel strategies to enhance jailbreak performance significantly.

Key Contributions

The paper makes several notable contributions to the optimization-based jailbreaking methodology:

Diverse Target Templates: The authors introduce the concept of employing diverse target templates imbued with harmful self-suggestions and guidance. This approach diversifies the optimization objectives, which helps in rendering the LLMs more susceptible to adversarial attacks by expanding the scope and nature of harmful content that the attacking model can generate.
Automatic Multi-Coordinate Updating Strategy: To bolster the optimization process, an automatic multi-coordinate updating strategy is proposed. This strategy adaptively determines the number of tokens to replace in each step, thereby accelerating convergence and improving the efficiency of generating harmful content.
Easy-to-Hard Initialization: The authors also present an innovative easy-to-hard initialization strategy. They start with generating a jailbreak suffix for simple harmful requests, which is then utilized as the initialization for more complex harmful requests. This staged approach leverages simpler tasks to bootstrap and better initialize the model for tackling more challenging tasks.

Experimentation and Results

The authors conducted extensive experimental evaluations of the proposed $\mathcal{I}$ -GCG on several LLMs, including VICUNA-7B-1.5, GUANACO-7B, LLAMA2-7B-CHAT, and MISTRAL-7B-INSTRUCT-0.2. They used the AdvBench dataset and the NeurIPS 2023 Red Teaming Track for their experiments. The results show that $\mathcal{I}$ -GCG achieves a nearly 100% attack success rate across multiple models, outperforming state-of-the-art methods like AutoDAN, MAC, and standard GCG.

Implications and Future Research

The implications of these improvements are both practical and theoretical:

Practical Impact: Practically, the increased efficiency and success rate of $\mathcal{I}$ -GCG underscore the pressing need for more robust security and alignment techniques in LLMs. The ability of $\mathcal{I}$ -GCG to efficiently generate harmful responses across a variety of sophisticated models highlights vulnerabilities that could be exploited in real-world applications, potentially leading to misuse and harmful societal impacts.
Theoretical Impact: The introduction of diverse target templates and the multi-coordinate updating strategy add new dimensions to the theoretical framework of adversarial attacks. These approaches could inspire further research into dynamic optimization strategies for generating adversarial inputs, not just in the context of LLMs but also in other areas of machine learning.

Speculation on Future Developments in AI

Future research could explore several avenues based on the findings of this paper:

Enhanced Defensive Mechanisms: There is an opportunity to develop more sophisticated defensive mechanisms against such optimized jailbreak attacks. This includes creating models that are robust against a wider variety of adversarial examples and employing continual learning strategies to adapt to emerging threats.
Transfer Learning for Adversarial Attacks: Given the observed transferability of adversarial suffix generation, future research could delve into transfer learning techniques to further improve the efficiency and effectiveness of adversarial attacks across different LLM architectures.
Ethical AI Development: The results of this paper underscore the need for rigorous ethical guidelines and automated auditing processes for AI systems to prevent harmful outputs. Research can also focus on developing AI models that inherently align more closely with human values and ethical guidelines.

Conclusion

The paper presents significant advancements in the optimization-based jailbreak attacks on LLMs. By incorporating diverse target templates, an automatic multi-coordinate updating strategy, and an easy-to-hard initialization technique, the proposed $\mathcal{I}$ -GCG method achieves superior jailbreak performance. These contributions not only enhance the current understanding of adversarial attack strategies in the field of LLMs but also pave the way for future research to build more robust and ethically-aligned AI systems.