CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CR

Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for LLMs. This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

References (34)

Authors (9)

Huijie Lv (3 papers)
Xiao Wang (507 papers)
Yuansen Zhang (6 papers)
Caishuang Huang (13 papers)
Shihan Dou (46 papers)
Junjie Ye (66 papers)
Tao Gui (127 papers)
Qi Zhang (785 papers)
Xuanjing Huang (287 papers)

Citations (18)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/XiaoWangNLP/status/1770286363809841600

https://twitter.com/knishimae0531/status/1767506218221007246

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Summary

Related Papers

Tweets