Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Jailbreak Challenges in Large Language Models (2310.06474v3)

Published 10 Oct 2023 in cs.CL

Abstract: While LLMs exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.

Multilingual Jailbreak Challenges in LLMs

The paper "Multilingual Jailbreak Challenges in LLMs" addresses crucial safety considerations within LLMs, such as ChatGPT and GPT-4, specifically focusing on multilingual contexts. This paper contributes to the ongoing discourse surrounding the ethical deployment of LLMs by revealing the vulnerabilities that arise when confronting multilingual jailbreak scenarios, offering quantitative assessments of these risks, and proposing mechanisms for safer use.

Research Context and Objectives

The proliferation of LLMs like ChatGPT, GPT-4, Claude, and Llama has facilitated their broad application across numerous domains. While their multilingual processing prowess owes much to extensive pre-training on diverse datasets, this generality also becomes a liability. Current safety measures are substantially skewed towards English, neglecting the multilingual capabilities these models inherently possess. The researchers identify two principal jailbreak scenarios: unintentional generation of unsafe outputs through non-English prompts, and deliberate manipulations using multilingual malicious instructions to extract harmful content.

Key Findings

  1. Unintentional Jailbreaks: The paper reveals that low-resource languages are about three times more susceptible to unsafe content generation in LLMs compared to high-resource languages. For instance, the unsafe content generation probability in Bengali, a low-resource language, is significantly higher, which points towards a serious deficiency in current safety mechanisms when examined under a multilingual lens.
  2. Intentional Jailbreaks: The paper emphasizes that combining multilingual prompts with malicious instructions remains alarmingly effective. In this scenario, the unsafe output rates escalated, with ChatGPT reaching an 80.92% success rate, while GPT-4 recorded a lesser, but still concerning, rate of 40.71%.
  3. Adaptive Multilingual Attacks: By iteratively querying models over multiple languages, an adversary can substantially increase the likelihood of bypassing safety measures. An adaptive strategy incorporating multiple low-resource languages resulted in almost a 45% success rate in breaching ChatGPT's safety protocols.

Proposed Solution: Self-Defense Framework

To address these vulnerabilities, the authors propose the Self-Defence framework. This novel paradigm invokes LLMs to auto-generate multilingual safety training data using a self-instruct method. Thereafter, this data is used to fine-tune LLMs, thus reinforcing their safety in a multilingual setup. Initial experiments demonstrated a promising decline in unsafe generation rates post-training.

Practical and Theoretical Implications

The research paper highlights the pressing need for enhanced safety protocols that encompass the multilingual capabilities of LLMs. Practically, it argues for a more inclusive approach in AI safety paradigms where multilingual training is prioritized similar to English. Theoretically, it poses questions about the optimal balance between safety and model generality, an area ripe for future investigations.

The introduction of the Self-Defence framework represents a significant stride in automating the production of diversified training data to facilitate a broader language coverage without excessive human effort. While the paper successfully decreases unsafe output instances, it also acknowledges the trade-off between safety and usefulness, indicating further fine-tuning in instructional content is essential for ideal model performance.

Conclusion and Future Directions

In sum, this paper provides an incisive look at the shortcomings of current LLM safety measures in multilingual contexts and offers a viable solution for attenuating these deficits. Going forward, research can explore the scalability of the Self-Defence framework and investigate integrating more sophisticated translation techniques to bolster language-specific safety alignments. Additionally, the paper lays the groundwork for more expansive ethical discussions and policies adjusting model capabilities across varying linguistic landscapes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Anthropic. Model card and evaluations for claude models. 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862, 2022. URL https://doi.org/10.48550/arXiv.2204.05862.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023, 2023. URL https://doi.org/10.48550/arXiv.2302.04023.
  4. Red-teaming large language models using chain of utterances for safety-alignment. ArXiv, abs/2308.09662, 2023. URL https://doi.org/10.48550/arXiv.2308.09662.
  5. A survey on adversarial attacks and defences. CAAI Trans. Intell. Technol., 6(1):25–45, 2021. URL https://doi.org/10.1049/cit2.12028.
  6. A survey on evaluation of large language models. CoRR, abs/2307.03109, 2023. URL https://doi.org/10.48550/arXiv.2307.03109.
  7. How is chatgpt’s behavior changing over time? CoRR, abs/2307.09009, 2023. URL https://doi.org/10.48550/arXiv.2307.09009.
  8. ChatGPT Goes to Law School, 2023. URL https://papers.ssrn.com/abstract=4335905.
  9. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  10. XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2475–2485. Association for Computational Linguistics, 2018. URL https://doi.org/10.18653/v1/d18-1269.
  11. Jailbreaker: Automated jailbreak across multiple large language model chatbots. CoRR, abs/2307.08715, 2023. URL https://doi.org/10.48550/arXiv.2307.08715.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022. URL https://doi.org/10.48550/arXiv.2209.07858.
  13. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3309–3326, 2022. URL https://doi.org/10.18653/v1/2022.acl-long.234.
  14. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. CoRR, abs/2305.06972, 2023. URL https://doi.org/10.48550/arXiv.2305.06972.
  15. Is chatgpt A good translator? A preliminary study. CoRR, abs/2301.08745, 2023. URL https://doi.org/10.48550/arXiv.2301.08745.
  16. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR, abs/2304.05613, 2023. URL https://doi.org/10.48550/arXiv.2304.05613.
  17. Multi-step jailbreaking privacy attacks on chatgpt. CoRR, abs/2304.05197, 2023. URL https://doi.org/10.48550/arXiv.2304.05197.
  18. Common sense beyond english: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  1274–1287. Association for Computational Linguistics, 2021. URL https://doi.org/10.18653/v1/2021.acl-long.102.
  19. Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860, 2023. URL https://doi.org/10.48550/arXiv.2305.13860.
  20. A holistic approach to undesired content detection in the real world. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp.  15009–15018. AAAI Press, 2023. URL https://doi.org/10.1609/aaai.v37i12.26752.
  21. OpenAI. Chatgpt. 2023a. URL https://openai.com/chatgpt.
  22. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b. URL https://doi.org/10.48550/arXiv.2303.08774.
  23. Training language models to follow instructions with human feedback. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  24. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  3419–3448, 2022. URL https://doi.org/10.18653/v1/2022.emnlp-main.225.
  25. Is chatgpt a general-purpose natural language processing task solver? CoRR, abs/2302.06476, 2023. URL https://doi.org/10.48550/arXiv.2302.06476.
  26. Exploring new frontiers in agricultural NLP: investigating the potential of large language models for food applications. CoRR, abs/2306.11892, 2023. URL https://doi.org/10.48550/arXiv.2306.11892.
  27. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR, abs/2308.03825, 2023. URL https://doi.org/10.48550/arXiv.2308.03825.
  28. Large language models encode clinical knowledge. CoRR, abs/2212.13138, 2022. URL https://doi.org/10.48550/arXiv.2212.13138.
  29. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. URL https://doi.org/10.48550/arXiv.2307.09288.
  30. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13484–13508, 2023. URL https://doi.org/10.18653/v1/2023.acl-long.754.
  31. Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483, 2023. URL https://doi.org/10.48550/arXiv.2307.02483.
  32. Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp.  2447–2469. Association for Computational Linguistics, 2021. URL https://doi.org/10.18653/v1/2021.findings-emnlp.210.
  33. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. CoRR, abs/2308.06463, 2023. URL https://doi.org/10.48550/arXiv.2308.06463.
  34. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179, 2023a. URL https://doi.org/10.48550/arXiv.2306.05179.
  35. Sentiment analysis in the era of large language models: A reality check. CoRR, abs/2305.15005, 2023b. URL https://doi.org/10.48550/arXiv.2305.15005.
  36. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023. URL https://doi.org/10.48550/arXiv.2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yue Deng (44 papers)
  2. Wenxuan Zhang (75 papers)
  3. Sinno Jialin Pan (32 papers)
  4. Lidong Bing (144 papers)
Citations (86)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com