CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (2403.07865v5)
Abstract: The rapid advancement of LLMs has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a new and universal safety vulnerability of these models against code input: CodeAttack bypasses the safety guardrails of all models more than 80\% of the time. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures. Furthermore, we give our hypotheses about the success of CodeAttack: the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk. Finally, we analyze potential mitigation measures. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
- Anthropic. 2023. Model card and evaluations for claude models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862.
- Constitutional ai: Harmlessness from ai feedback. ArXiv.
- Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073.
- Emergent autonomous scientific research capabilities of large language models. ArXiv, abs/2304.05332.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650. USENIX Association.
- Jailbreaking black box large language models in twenty queries. ArXiv, abs/2310.08419.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Multilingual jailbreak challenges in large language models. ArXiv, abs/2310.06474.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. ArXiv, abs/1908.06083.
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv, abs/2209.07858.
- Annollm: Making large language models to be better crowdsourced annotators. ArXiv, abs/2303.16854.
- Automatically auditing large language models via discrete optimization. ArXiv.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. ArXiv, abs/2302.05733.
- The stack: 3 tb of permissively licensed source code. arXiv.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations.
- Flirt: Feedback loop in-context red teaming. ArXiv, abs/2308.04265.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv.
- OpenAI. 2023. Chatgpt. https://openai.com/chatgpt.
- OpenAI. 2024. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
- Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations.
- Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 5673–5684, New York, NY, USA. Association for Computing Machinery.
- Code llama: Open foundation models for code. ArXiv.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. ArXiv, abs/2311.03348.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- Jailbroken: How does LLM safety training fail? In Neural Information Processing Systems.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In The Twelfth International Conference on Learning Representations.
- Universal and transferable adversarial attacks on aligned language models. ArXiv, abs/2307.15043.
- Qibing Ren (6 papers)
- Chang Gao (54 papers)
- Jing Shao (109 papers)
- Junchi Yan (241 papers)
- Xin Tan (63 papers)
- Wai Lam (117 papers)
- Lizhuang Ma (145 papers)