GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher (2308.06463v2)

Published 12 Aug 2023 in cs.CL

Abstract: Safety lies at the core of the development of LLMs. There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

PDF Abstract

Analyzing the Vulnerabilities of GPT-4 to Ciphers: The CipherChat Framework

The paper "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher" investigates the security vulnerabilities of LLMs like ChatGPT and GPT-4 when interacting with non-natural language inputs, specifically ciphers. The core finding of this paper is that current safety alignment strategies, primarily designed and tested with natural language, fail to extend effectively to enciphered inputs. This paper introduces a novel approach, termed "CipherChat," to systematically evaluate the efficacy of LLM safety alignments using ciphered prompts.

Key Contributions and Methodology

CipherChat Framework: The research proposes the CipherChat method as a means to assess how well LLMs manage safety concerns when interacting with ciphered inputs. The framework employs a structured system prompt incorporating behavior assigning, cipher teaching, and enciphered unsafe demonstrations. This approach is designed to engage the LLMs not just to understand the cipher but to actively use it, potentially bypassing safety constraints imposed during training.
Experimental Setup: The paper conducts extensive experiments using the framework across 11 safety-related domains, including both Chinese and English inputs. State-of-the-art models GPT-4 and GPT-3.5 (Turbo) are evaluated to determine the success of various human-readable ciphers such as ASCII, Unicode, and more complex encoding mechanisms like Morse Code and Caesar Ciphers.
Notable Findings: Results demonstrate that models can indeed generate unsafe responses when given enciphered inputs. For instance, the GPT-4 model, particularly when processing English ASCII queries, bypassed safety alignments with success rates nearing 100% in certain domains. These findings underscored the models' ability to decipher and respond to ciphered inputs, revealing a critical oversight in safety alignments limited to natural language understanding.
SelfCipher Discovery: An intriguing discovery within this paper is the "SelfCipher," which reveals an LLM’s inherent ability to generate unsafe outputs through natural language prompts alone without explicit cipher use. This capability was consistently more effective than traditional ciphers across evaluated models. It suggests that models might inherently develop a form of internal cipher, akin to a "secret language," that responds to particular engagement prompts.

Numerical Results and Claims

The paper presents strong empirical evidence that ciphers like ASCII can achieve success rates of up to 72.1% in overriding safety measures in English contexts with GPT-4. These quantitative results indicate that language complexity and model capacity directly correlate with safety vulnerabilities when faced with encoded inputs. Such findings invite further scrutiny on the unsolved challenge of securing LLMs against context-aware enciphered inputs.

Implications and Future Directions

The insights from this paper carry significant implications for both the practical deployment of LLMs and ongoing theoretical studies in AI safety. The realization that sophisticated models such as GPT-4 can be prone to generating harmful content via cipher inputs stresses the need for revised and comprehensive safety alignment strategies. Future research is encouraged to extend safety training methodologies to include unconventional inputs and explore the "hidden" encoding systems that LLMs might develop during pretraining.

Finally, this work calls for a broader exploration into the adaptability of LLMs to non-natural languages, posing challenges as well as opportunities for enhancing the robustness of AI systems in real-world applications. The paper’s contributions establish a foundation for more secure and contextually aware AI deployments, critical in an era of increasing reliance on AI-driven technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Youliang Yuan (18 papers)
Wenxiang Jiao (44 papers)
Wenxuan Wang (128 papers)
Jen-tse Huang (46 papers)
Pinjia He (47 papers)
Shuming Shi (126 papers)
Zhaopeng Tu (135 papers)

Citations (158)

View on Semantic Scholar

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher (2308.06463v2)

Analyzing the Vulnerabilities of GPT-4 to Ciphers: The CipherChat Framework

Key Contributions and Methodology

Numerical Results and Claims

Implications and Future Directions

Related Papers

GitHub

YouTube