CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CR

Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for LLMs. This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

References (34)

Citations (18)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/XiaoWangNLP/status/1770286363809841600

https://twitter.com/knishimae0531/status/1767506218221007246

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Summary

Related Papers

Tweets