Multilingual Jailbreak Challenges in Large Language Models (2310.06474v3)

Published 10 Oct 2023 in cs.CL

Abstract: While LLMs exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.

PDF HTML Abstract

Multilingual Jailbreak Challenges in LLMs

The paper "Multilingual Jailbreak Challenges in LLMs" addresses crucial safety considerations within LLMs, such as ChatGPT and GPT-4, specifically focusing on multilingual contexts. This paper contributes to the ongoing discourse surrounding the ethical deployment of LLMs by revealing the vulnerabilities that arise when confronting multilingual jailbreak scenarios, offering quantitative assessments of these risks, and proposing mechanisms for safer use.

Research Context and Objectives

The proliferation of LLMs like ChatGPT, GPT-4, Claude, and Llama has facilitated their broad application across numerous domains. While their multilingual processing prowess owes much to extensive pre-training on diverse datasets, this generality also becomes a liability. Current safety measures are substantially skewed towards English, neglecting the multilingual capabilities these models inherently possess. The researchers identify two principal jailbreak scenarios: unintentional generation of unsafe outputs through non-English prompts, and deliberate manipulations using multilingual malicious instructions to extract harmful content.

Key Findings

Unintentional Jailbreaks: The paper reveals that low-resource languages are about three times more susceptible to unsafe content generation in LLMs compared to high-resource languages. For instance, the unsafe content generation probability in Bengali, a low-resource language, is significantly higher, which points towards a serious deficiency in current safety mechanisms when examined under a multilingual lens.
Intentional Jailbreaks: The paper emphasizes that combining multilingual prompts with malicious instructions remains alarmingly effective. In this scenario, the unsafe output rates escalated, with ChatGPT reaching an 80.92% success rate, while GPT-4 recorded a lesser, but still concerning, rate of 40.71%.
Adaptive Multilingual Attacks: By iteratively querying models over multiple languages, an adversary can substantially increase the likelihood of bypassing safety measures. An adaptive strategy incorporating multiple low-resource languages resulted in almost a 45% success rate in breaching ChatGPT's safety protocols.

Proposed Solution: Self-Defense Framework

To address these vulnerabilities, the authors propose the Self-Defence framework. This novel paradigm invokes LLMs to auto-generate multilingual safety training data using a self-instruct method. Thereafter, this data is used to fine-tune LLMs, thus reinforcing their safety in a multilingual setup. Initial experiments demonstrated a promising decline in unsafe generation rates post-training.

Practical and Theoretical Implications

The research paper highlights the pressing need for enhanced safety protocols that encompass the multilingual capabilities of LLMs. Practically, it argues for a more inclusive approach in AI safety paradigms where multilingual training is prioritized similar to English. Theoretically, it poses questions about the optimal balance between safety and model generality, an area ripe for future investigations.

The introduction of the Self-Defence framework represents a significant stride in automating the production of diversified training data to facilitate a broader language coverage without excessive human effort. While the paper successfully decreases unsafe output instances, it also acknowledges the trade-off between safety and usefulness, indicating further fine-tuning in instructional content is essential for ideal model performance.

Conclusion and Future Directions

In sum, this paper provides an incisive look at the shortcomings of current LLM safety measures in multilingual contexts and offers a viable solution for attenuating these deficits. Going forward, research can explore the scalability of the Self-Defence framework and investigate integrating more sophisticated translation techniques to bolster language-specific safety alignments. Additionally, the paper lays the groundwork for more expansive ethical discussions and policies adjusting model capabilities across varying linguistic landscapes.