CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)
Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for LLMs. This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Jailbreaking black box large language models in twenty queries. ArXiv, abs/2310.08419.
- Cheng-Han Chiang and Hung yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- Pal: Program-aided language models.
- Catastrophic jailbreak of open-source llms via exploiting generation. CoRR, abs/2310.06987.
- Automatically auditing large language models via discrete optimization. ArXiv, abs/2303.04381.
- Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 4138–4153. Association for Computational Linguistics.
- Codeie: Large code generation models are better few-shot information extractors.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. CoRR, abs/2310.04451.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860.
- Language models of code are few-shot commonsense learners.
- Prompting with pseudo-code instructions.
- Codegen: An open large language model for code with multi-turn program synthesis.
- Codexleaks: Privacy leaks from code generation language models in github copilot. In 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9-11, 2023, pages 2133–2150. USENIX Association.
- OpenAI. 2023. Chatgpt. https://openai.com/chatgpt.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR, abs/2310.03693.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR, abs/2308.03825.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- TRACE: A comprehensive benchmark for continual learning in large language models. CoRR, abs/2310.06762.
- Code4struct: Code generation for few-shot structured prediction from natural language.
- Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- Shadow alignment: The ease of subverting safely-aligned language models. CoRR, abs/2310.02949.
- GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. CoRR, abs/2308.06463.
- On the safety of open-sourced large language models: Does alignment really prevent them from being misused? arXiv preprint arXiv:2310.01581.
- Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
- Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
- Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043.
- Huijie Lv (3 papers)
- Xiao Wang (507 papers)
- Yuansen Zhang (6 papers)
- Caishuang Huang (13 papers)
- Shihan Dou (46 papers)
- Junjie Ye (66 papers)
- Tao Gui (127 papers)
- Qi Zhang (785 papers)
- Xuanjing Huang (287 papers)