Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages? (2402.18120v3)

Published 28 Feb 2024 in cs.CL

Abstract: Prior research has revealed that certain abstract concepts are linearly represented as directions in the representation space of LLMs, predominantly centered around English. In this paper, we extend this investigation to a multilingual context, with a specific focus on human values-related concepts (i.e., value concepts) due to their significance for AI safety. Through our comprehensive exploration covering 7 types of human values, 16 languages and 3 LLM series with distinct multilinguality (e.g., monolingual, bilingual and multilingual), we first empirically confirm the presence of value concepts within LLMs in a multilingual format. Further analysis on the cross-lingual characteristics of these concepts reveals 3 traits arising from language resource disparities: cross-lingual inconsistency, distorted linguistic relationships, and unidirectional cross-lingual transfer between high- and low-resource languages, all in terms of value concepts. Moreover, we validate the feasibility of cross-lingual control over value alignment capabilities of LLMs, leveraging the dominant language as a source language. Ultimately, recognizing the significant impact of LLMs' multilinguality on our results, we consolidate our findings and provide prudent suggestions on the composition of multilingual data for LLMs pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Qwen technical report. CoRR, abs/2309.16609.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
  3. Sunit Bhattacharya and Ondrej Bojar. 2023. Unveiling multilinguality in transformer models: Exploring language specificity in feed-forward networks. CoRR, abs/2310.15552.
  4. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5486–5505. Association for Computational Linguistics.
  5. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 3576–3588. Association for Computational Linguistics.
  6. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7057–7067.
  7. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. CoRR, abs/2401.05778.
  8. Multilingual jailbreak challenges in large language models. CoRR, abs/2310.06474.
  9. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  10. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, pages 3356–3369.
  11. Evaluating large language models: A comprehensive survey. CoRR, abs/2310.19736.
  12. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 9318–9333.
  13. Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
  14. A survey of safety and trustworthiness of large language models through the lens of verification and validation. CoRR, abs/2305.11391.
  15. Is chatgpt A good translator? A preliminary study. CoRR, abs/2301.08745.
  16. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6282–6293. Association for Computational Linguistics.
  17. Self-detoxifying language models via toxification reversal. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4433–4449.
  18. Inference-time intervention: Eliciting truthful answers from a language model. CoRR, abs/2306.03341.
  19. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252.
  20. In-context vectors: Making in context learning more effective and controllable through latent space steering. CoRR, abs/2311.06668.
  21. Aligning large language models with human preferences through representation engineering. CoRR, abs/2312.15997.
  22. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5356–5371.
  23. Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses. CoRR, abs/2305.11662.
  24. OpenAI. 2023a. ChatGPT.
  25. OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
  26. Cross-lingual consistency of factual knowledge in multilingual language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10650–10666.
  27. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  28. The language barrier: Dissecting safety challenges of llms in multilingual contexts. CoRR, abs/2401.13136.
  29. Function vectors in large language models. CoRR, abs/2310.15213.
  30. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  31. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. CoRR, abs/2306.11698.
  32. Haoran Wang and Kai Shu. 2023. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. CoRR, abs/2311.09433.
  33. Language representation projection: Can we transfer factual knowledge across languages in multilingual language models? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3692–3702. Association for Computational Linguistics.
  34. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics.
  35. Low-resource languages jailbreak gpt-4. CoRR, abs/2310.02446.
  36. Don’t trust chatgpt when your question is not in english: A study of multilingual abilities and types of llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7915–7927.
  37. Representation engineering: A top-down approach to AI transparency. CoRR, abs/2310.01405.
  38. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shaoyang Xu (6 papers)
  2. Weilong Dong (9 papers)
  3. Zishan Guo (5 papers)
  4. Xinwei Wu (9 papers)
  5. Deyi Xiong (103 papers)
Citations (6)