Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts (2401.13136v1)

Published 23 Jan 2024 in cs.CL and cs.AI

Abstract: As the influence of LLMs spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Do language models know when they’re hallucinating references?
  2. On the multilingual capabilities of very large-scale English language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3056–3068, Marseille, France. European Language Resources Association.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. Unsupervised cross-lingual representation learning at scale.
  8. Multilingual jailbreak challenges in large language models. In The Thirteenth International Conference on Learning Representations.
  9. Chain-of-verification reduces hallucination in large language models.
  10. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
  11. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  13. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215.
  14. Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning (CoRL), pages 2014–2025.
  15. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR).
  16. Cross-lingual ability of multilingual bert: An empirical study.
  17. Ammus : A survey of transformer-based pretrained models in natural language processing.
  18. UnifiedQA: Crossing Format Boundaries With a Single QA System. In Conference on Empirical Methods in Natural Language Processing (EMNLP) - Findings.
  19. Prosocialdialog: A prosocial backbone for conversational agents. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  20. Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582.
  21. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation.
  22. Multi-step jailbreaking privacy attacks on chatgpt.
  23. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning.
  24. Quark: controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609.
  25. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  26. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  27. Training Language Models to Follow Instructions with Human Feedback. In Advances in Neural Information Processing Systems (NeurIPS).
  28. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  30. The trickle-down impact of reward (in-) consistency on rlhf. arXiv preprint arXiv:2309.16155.
  31. ”Do Anything Now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  32. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
  33. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  35. Do-not-answer: A dataset for evaluating safeguards in llms.
  36. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  37. Fundamental limitations of alignment in large language models.
  38. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
  39. A paradigm shift in machine translation: Boosting translation performance of large language models. arXiv preprint arXiv:2309.11674.
  40. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.
  41. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  42. Low-resource languages jailbreak gpt-4.
  43. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models.
  44. Ethical-advice taker: Do language models understand natural language interventions? In Annual Meeting of the Association for Computational Linguistics (ACL) - Findings.
  45. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  46. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Lingfeng Shen (18 papers)
  2. Weiting Tan (14 papers)
  3. Sihao Chen (25 papers)
  4. Yunmo Chen (20 papers)
  5. Jingyu Zhang (40 papers)
  6. Haoran Xu (77 papers)
  7. Boyuan Zheng (27 papers)
  8. Philipp Koehn (60 papers)
  9. Daniel Khashabi (83 papers)
Citations (27)
X Twitter Logo Streamline Icon: https://streamlinehq.com