Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (2402.16717v1)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CR

Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for LLMs. This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  2. Jailbreaking black box large language models in twenty queries. ArXiv, abs/2310.08419.
  3. Cheng-Han Chiang and Hung yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
  6. Pal: Program-aided language models.
  7. Catastrophic jailbreak of open-source llms via exploiting generation. CoRR, abs/2310.06987.
  8. Automatically auditing large language models via discrete optimization. ArXiv, abs/2303.04381.
  9. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
  10. Multi-step jailbreaking privacy attacks on chatgpt. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 4138–4153. Association for Computational Linguistics.
  11. Codeie: Large code generation models are better few-shot information extractors.
  12. Autodan: Generating stealthy jailbreak prompts on aligned large language models. CoRR, abs/2310.04451.
  13. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  14. Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860.
  15. Language models of code are few-shot commonsense learners.
  16. Prompting with pseudo-code instructions.
  17. Codegen: An open large language model for code with multi-turn program synthesis.
  18. Codexleaks: Privacy leaks from code generation language models in github copilot. In 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9-11, 2023, pages 2133–2150. USENIX Association.
  19. OpenAI. 2023. Chatgpt. https://openai.com/chatgpt.
  20. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  21. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  22. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR, abs/2310.03693.
  23. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR, abs/2308.03825.
  24. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  25. TRACE: A comprehensive benchmark for continual learning in large language models. CoRR, abs/2310.06762.
  26. Code4struct: Code generation for few-shot structured prediction from natural language.
  27. Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483.
  28. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  29. Shadow alignment: The ease of subverting safely-aligned language models. CoRR, abs/2310.02949.
  30. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. CoRR, abs/2308.06463.
  31. On the safety of open-sourced large language models: Does alignment really prevent them from being misused? arXiv preprint arXiv:2310.01581.
  32. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
  33. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
  34. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Huijie Lv (3 papers)
  2. Xiao Wang (507 papers)
  3. Yuansen Zhang (6 papers)
  4. Caishuang Huang (13 papers)
  5. Shihan Dou (46 papers)
  6. Junjie Ye (66 papers)
  7. Tao Gui (127 papers)
  8. Qi Zhang (785 papers)
  9. Xuanjing Huang (287 papers)
Citations (18)

Summary

We haven't generated a summary for this paper yet.