Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation (2405.13068v2)

Published 20 May 2024 in cs.CR, cs.AI, and cs.LG

Abstract: LLMs have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful LLMs.

PDF Abstract

An Expert Review of "Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation"

This paper explores the susceptibility of LLMs to jailbreaking attacks by introducing a novel method called JailMine. The authors address the limitations of existing token-level jailbreak techniques in terms of scalability and efficiency, particularly as defensive measures in LLMs continue to evolve.

Methodology

JailMine leverages an automated mining process to manipulate the token outputs of LLMs strategically. The aim is to elicit responses from LLMs that would typically be prevented by safety constraints. The paper introduces a systematic approach to tailoring input prompts for gaining affirmative responses from LLMs and minimizing rejection likelihood. This involves a combination of token-level manipulation and an understanding of LLMs' probabilistic output behavior.

Evaluation and Results

The paper presents a comprehensive evaluation of JailMine using five well-known LLMs: Llama-2-7b-chat, Llama-2-13b-chat, Mistral-7b-Instruct, Llama-3-8b-Instruct, and gemma-7b-it. The assessment is conducted on two benchmarks—AdvBench and JailbreakBench—demonstrating JailMine's superior performance compared to three other established baselines: GCG, PAIR, and GPTFuzzer. Notably, JailMine achieves an average Attack Success Rate (ASR) of 96% on AdvBench and 94% on JailbreakBench, outperforming the baselines by significant margins.

In terms of efficiency, JailMine demonstrates substantial improvements. It reduces the computational time for executing jailbreak attacks by an average of 86% compared with GCG, further illustrating its practicality in real-world scenarios.

Implications

The implications of this research are significant for both practitioners and developers of LLMs. From a practical standpoint, understanding and mitigating vulnerabilities in LLMs is crucial as they become more integrated into applications across various domains. The systematic exploration of token-level manipulation strategies highlights the necessity for more robust defensive mechanisms within LLM frameworks.

Future Directions

Future work stemming from this research could focus on enhancing the defensive capabilities of LLMs against token-level attacks. Potential areas include diversifying denial patterns to make sorting manipulations more challenging and adapting fine-tuning processes to dynamically counteract adversarial techniques like JailMine. Additionally, exploring the potential of integrating explainability and transparency in LLM decision-making could aid in identifying and rectifying vulnerabilities more effectively.

Conclusion

This paper makes a significant contribution to understanding the vulnerabilities of LLMs and presents a highly effective and efficient approach for jailbreaking these models. While highlighting a critical area of concern in AI safety and security, it paves the way for more sophisticated and secure LLM deployments in the future. This insight is invaluable for developers, researchers, and policymakers involved in the advancement of AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yuxi Li (45 papers)
Yi Liu (543 papers)
Yuekang Li (34 papers)
Ling Shi (119 papers)
Gelei Deng (35 papers)
Shengquan Chen (1 paper)
Kailong Wang (41 papers)

Citations (11)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/FSFG/status/1793883037517619509