Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CR

Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for LLMs. This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  2. Jailbreaking black box large language models in twenty queries. ArXiv, abs/2310.08419.
  3. Cheng-Han Chiang and Hung yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
  6. Pal: Program-aided language models.
  7. Catastrophic jailbreak of open-source llms via exploiting generation. CoRR, abs/2310.06987.
  8. Automatically auditing large language models via discrete optimization. ArXiv, abs/2303.04381.
  9. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
  10. Multi-step jailbreaking privacy attacks on chatgpt. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 4138–4153. Association for Computational Linguistics.
  11. Codeie: Large code generation models are better few-shot information extractors.
  12. Autodan: Generating stealthy jailbreak prompts on aligned large language models. CoRR, abs/2310.04451.
  13. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  14. Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860.
  15. Language models of code are few-shot commonsense learners.
  16. Prompting with pseudo-code instructions.
  17. Codegen: An open large language model for code with multi-turn program synthesis.
  18. Codexleaks: Privacy leaks from code generation language models in github copilot. In 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9-11, 2023, pages 2133–2150. USENIX Association.
  19. OpenAI. 2023. Chatgpt. https://openai.com/chatgpt.
  20. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  21. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  22. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR, abs/2310.03693.
  23. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR, abs/2308.03825.
  24. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  25. TRACE: A comprehensive benchmark for continual learning in large language models. CoRR, abs/2310.06762.
  26. Code4struct: Code generation for few-shot structured prediction from natural language.
  27. Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483.
  28. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  29. Shadow alignment: The ease of subverting safely-aligned language models. CoRR, abs/2310.02949.
  30. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. CoRR, abs/2308.06463.
  31. On the safety of open-sourced large language models: Does alignment really prevent them from being misused? arXiv preprint arXiv:2310.01581.
  32. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
  33. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
  34. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043.
Citations (18)

Summary

We haven't generated a summary for this paper yet.