Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Resource Languages Jailbreak GPT-4 (2310.02446v2)

Published 3 Oct 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: AI safety training and red-teaming of LLMs are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

Assessment of Cross-Lingual Safety Vulnerabilities in GPT-4

The paper "Low-Resource Languages Jailbreak GPT-4" explores a critical aspect of AI safety regarding LLMs by examining vulnerabilities in GPT-4's safety mechanisms across different languages. The authors present a systematic analysis demonstrating that safety margin deficiencies caused by linguistic disparities pose significant security risks when translating unsafe inputs from English into low-resource languages.

The investigation involves translating unsafe English inputs into lesser-resourced languages using publicly available APIs like Google Translate. Evaluated on the AdvBench benchmark, these translated inputs had a 79% success rate in bypassing GPT-4's safeguards and eliciting harmful responses, rivaling even the most robust contemporary jailbreaking techniques. This suggests a pronounced vulnerability in GPT-4's cross-lingual safety measures that are inefficient in lower-resourced contexts compared to high/mid-resource language scenarios, where attack success rates were markedly lower.

The authors advance several compelling arguments and implications:

  1. Cross-Lingual AI Vulnerability: The paper highlights that GPT-4 and, likely, other LLMs exhibit significant safety lapses when interfaced in low-resource languages. Historically, insufficient training data primarily affected accessibility and utility for speakers of low-resource languages. The findings, however, indicate a broader jeopardy—expanding the potential for model misuse across all language users. The ease of accessing automated translation services exacerbates this risk, enabling attackers to exploit safety loopholes in LLMs.
  2. Imbalanced Linguistic Representation: This vulnerability underscores a persistent imbalance in AI safety and linguistic representation in model training. The research reveals that GPT-4's safety mechanisms fail to adequately generalize across languages, a shortcoming that the authors attribute to skewed priorities within AI alignment training. There is an evident need for more equitable and inclusive safety measures that ensure LLMs perform effectively across linguistic boundaries, with comprehensive coverage of low-resource languages.
  3. Necessity for Multilingual Safety Protocols: The conclusion presses for an imperative expansion of red-teaming approaches beyond monolingual and predominantly English-centric frameworks. While current models may pass English-centric safety tests, the reality is that models like GPT-4 are deployed across multilingual platforms and use-cases, necessitating robust defenses against multi-lingual threat vectors. Therefore, developing datasets and benchmarks for multilingual safety assurance is crucial for establishing comprehensive security standards in AI models.

From these perspectives, the research evidently lays ground for heightened rigor in safety protocol development across diverse linguistic landscapes, ensuring LLMs like GPT-4 remain reliable and accountable in performance across varied user demographics. Future work in the area might involve a deeper investigation into the mechanisms behind the identified vulnerabilities in translation-based attacks and exploring scalable approaches to enhance safety across different LLMs without compromising performance or accessibility to disadvantaged linguistic populations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Jigsaw/Conversation AI. Jigsaw multilingual toxic comment classification, 2020. https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification, Last accessed on 2023-09-14.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  5. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022.
  6. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023.
  7. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.376. URL https://aclanthology.org/2022.acl-long.376.
  8. Jailbreak Chat. Aim, 2023a. https://www.jailbreakchat.com/prompt/4f37a029-9dff-4862-b323-c96a5504de5d, Last accessed on 2023-09-13.
  9. Jailbreak Chat. Translatorbot, 2023b. https://www.jailbreakchat.com/prompt/3e93895c-2542-4201-a297-aa8be2db8bd7, Last accessed on 2023-09-11.
  10. How is chatgpt’s behavior changing over time?, 2023.
  11. CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2819–2829, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1271. URL https://aclanthology.org/P19-1271.
  12. Google Cloud. Language support, 2023. https://cloud.google.com/translate/docs/languages, Last accessed on 2023-09-14.
  13. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  15. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
  16. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  17. Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages. arXiv preprint arXiv:2305.10510, 2023.
  18. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
  19. Cbbq: A chinese bias benchmark dataset curated with human-ai collaboration for large language models. arXiv preprint arXiv:2306.16244, 2023.
  20. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1215. URL https://aclanthology.org/D17-1215.
  21. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.560.
  22. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613, 2023.
  23. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023.
  24. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023a.
  25. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.
  26. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  27. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790, 2021.
  28. OpenAI. Duolingo, 2023a. https://openai.com/customer-stories/duolingo, Last accessed on 2023-09-14.
  29. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023b.
  30. OpenAI. Government of iceland, 2023c. https://openai.com/customer-stories/government-of-iceland, Last accessed on 2023-09-14.
  31. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  32. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.225. URL https://aclanthology.org/2022.emnlp-main.225.
  33. Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2340–2354, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.184. URL https://aclanthology.org/2022.findings-acl.184.
  34. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023.
  35. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  36. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2659–2673, 2022.
  37. Universal adversarial attacks with natural triggers for text classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3724–3733, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.291. URL https://aclanthology.org/2021.naacl-main.291.
  38. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1164. URL https://aclanthology.org/P19-1164.
  39. ChatGPT is not a good indigenous translator. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 163–167, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.americasnlp-1.17. URL https://aclanthology.org/2023.americasnlp-1.17.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Translated. Translated unleashes full gpt-4 potential for businesses operating in languages other than english, 2023. https://translated.com/t-lm-gpt-integration, Last accessed on 2023-09-14.
  42. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
  43. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023.
  44. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  45. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  46. Prompting large language models to generate code-mixed texts: The case of south east asian languages. arXiv preprint arXiv:2303.13592, 2023.
  47. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  48. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2, 2023.
  49. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zheng-Xin Yong (23 papers)
  2. Cristina Menghini (13 papers)
  3. Stephen H. Bach (33 papers)
Citations (130)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit