Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense (2410.21573v2)

Published 28 Oct 2024 in cs.CL and cs.AI

Abstract: Multilingual LLMs have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive LLMing, promoting fairer access for the wider multilingual community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Phi-3 technical report: A highly capable language model locally on your phone. Preprint, arXiv:2404.14219.
  2. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
  3. MasakhaNEWS: News topic classification for African languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 144–159, Nusa Dua, Bali. Association for Computational Linguistics.
  4. Yi: Open foundation models by 01.ai. Preprint, arXiv:2403.04652.
  5. V.S.D.S.Mahesh Akavarapu and Arnab Bhattacharya. 2024. Automated cognate detection as a supervised link prediction task with cognate transformer. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 965–975, St. Julian’s, Malta. Association for Computational Linguistics.
  6. Keith Allan. 2009. Concise encyclopedia of semantics. Elsevier.
  7. Aya 23: Open weight releases to further multilingual progress. Preprint, arXiv:2405.15032.
  8. Quentin D Atkinson. 2013. The descent of words. Proceedings of the National Academy of Sciences, 110(11):4159–4160.
  9. Combining noisy semantic signals with orthographic cues: Cognate induction for the indic dialect continuum. In Conference on Computational Natural Language Learning.
  10. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 749–775, Bangkok, Thailand. Association for Computational Linguistics.
  11. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali. Association for Computational Linguistics.
  12. CogNet: A large-scale cognate database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145, Florence, Italy. Association for Computational Linguistics.
  13. A large and evolving cognate database. Language Resources and Evaluation, 56:165 – 189.
  14. Gábor Berend. 2023. Combating the curse of multilinguality in cross-lingual wsd by aligning sparse contextualized word representations. arXiv preprint arXiv:2307.13776.
  15. Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information. In Proceedings of the conference-Association for Computational Linguistics. Meeting, pages 2854–2864. Association for Computational Linguistics.
  16. Recent trends in word sense disambiguation: A survey. In International Joint Conference on Artificial Intelligence, pages 4330–4338. International Joint Conference on Artificial Intelligence, Inc.
  17. Terra Blevins and Luke Zettlemoyer. 2020. Moving down the long tail of word sense disambiguation with gloss-informed biencoders. arXiv preprint arXiv:2005.02590.
  18. Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1352–1362.
  19. Samuel Cahyawijaya. 2024. Llm for everyone: Representing the underrepresented in large language models. Preprint, arXiv:2409.13897.
  20. NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
  21. Cendol: Open instruction-tuned generative large language models for Indonesian languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14899–14914, Bangkok, Thailand. Association for Computational Linguistics.
  22. InstructAlign: High-and-low resource language alignment via continual crosslingual instruction tuning. In Proceedings of the First Workshop in South East Asian Language Processing, pages 55–78, Nusa Dua, Bali, Indonesia. Association for Computational Linguistics.
  23. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press, Edinburgh.
  25. A high coverage method for automatic false Friends detection for Spanish and Portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 29–36, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  26. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  27. Cohere For AI. 2024a. c4ai-command-r-08-2024.
  28. Cohere For AI. 2024b. c4ai-command-r-plus-08-2024.
  29. A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  30. Robocop: A comprehensive Romance BOrrowing COgnate Package and benchmark for multilingual cognate identification. In Conference on Empirical Methods in Natural Language Processing.
  31. Ethnologue: Languages of the World. Twenty-seventh edition. SIL International, Dallas, Texas.
  32. Javier Ferrando and Marta R Costa-jussà. 2024. On the similarity of circuits across languages: a case study on the subject-verb agreement task. arXiv preprint arXiv:2410.06496.
  33. Chatglm: A family of large language models from glm-130b to glm-4 all tools. Preprint, arXiv:2406.12793.
  34. Glottolog 5.0. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  35. Training compute-optimal large language models. Preprint, arXiv:2203.15556.
  36. Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  37. Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
  38. Scaling laws for neural language models. Preprint, arXiv:2001.08361.
  39. Deny Arnos Kwary and Nor Hashimah Jalaluddin. 2015. The lexicography of Indonesian/Malay, page 1–11. Springer, Berlin.
  40. Bloom: A 176b-parameter open-access multilingual language model.
  41. Exploring lexical differences between Indonesian and Malay. In 2018 International Conference on Asian Language Processing (IALP), pages 178–183.
  42. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017.
  43. Nikola Ljubesic and Darja Fišer. 2013. Identifying false friends between closely related languages. In BSNLP@ACL.
  44. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. Preprint, arXiv:2406.10118.
  45. Mohd Sharifudin bin Yusop and Mahyudin Al Mudra. 2015. Kamus Komunikatif Nusantara: Indonesia-Malaysia, Malaysia-Indonesia. Balai Kajian dan Pengembangan Budaya Melayu, Yogyakarta.
  46. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  47. Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 222–231.
  48. Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial intelligence, 193:217–250.
  49. SeaLLMs - large language models for Southeast Asia. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 294–304, Bangkok, Thailand. Association for Computational Linguistics.
  50. Hiroki Nomoto. 2023. Issues surrounding the use of ChatGPT in similar languages: The case of Malay and Indonesian. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–82. Association for Computational Linguistics.
  51. Reclassification of the Leipzig Corpora Collection for Malay and Indonesian. NUSA, 65:47–66.
  52. Masalah teknologi dan isu sosial berkaitan penggunaan ChatGPT dalam bahasa Melayu [Technological and social issues related to using ChatGPT in Malay]. RENTAS: Jurnal Bahasa, Sastera dan Budaya, 3(1):1–22.
  53. David Ong and Peerat Limkonchotiwat. 2023. SEA-LION (Southeast Asian languages in one network): A family of Southeast Asian language models. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 245–245, Singapore. Association for Computational Linguistics.
  54. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  55. Tommaso Pasini. 2021. The knowledge acquisition bottleneck problem in multilingual word sense disambiguation. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4936–4942.
  56. Xl-wsd: An extra-large and cross-lingual evaluation framework for word sense disambiguation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13648–13656.
  57. Mohammad Taher Pilehvar and Jose Camacho-Collados. 2018. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121.
  58. Xl-wic: A multilingual benchmark for evaluating semantic contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7193–7206.
  59. Rusdi Abdullah. 2016. Kamus Kata Bahasa Melayu Malaysia-Bahasa Indonesia. Penerbit Universiti Kebangsaan Malaysia, Bangi.
  60. Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11521–11567, Bangkok, Thailand. Association for Computational Linguistics.
  61. Multilingual word sense disambiguation with unified sense representation. arXiv preprint arXiv:2210.07447.
  62. Qwen Team. 2024. Qwen2.5: A party of foundation models.
  63. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  64. Ana Sabina Uban and Liviu P Dinu. 2020. Automatically building a multilingual lexicon of false friends with no supervision. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3001–3007.
  65. Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15894–15939, Bangkok, Thailand. Association for Computational Linguistics.
  66. Warren Weaver. 1949. Translation. In William N. Locke and A. Donald Boothe, editors, Machine Translation of Languages, pages 15–23. MIT Press, Cambridge, MA. Reprinted from a memorandum written by Weaver in 1949.
  67. Do llamas work in english? on the latent language of multilingual transformers. arXiv preprint arXiv:2402.10588.
  68. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
  69. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
  70. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136.
  71. mt5: A massively multilingual pre-trained text-to-text transformer. In North American Chapter of the Association for Computational Linguistics.
  72. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  73. Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63, Singapore. Association for Computational Linguistics.
  74. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
  75. Ruochen Zhang and Carsten Eickhoff. 2024. CroCoSum: A benchmark dataset for cross-lingual code-switched summarization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4113–4126, Torino, Italia. ELRA and ICCL.
  76. The same but different: Structural similarities and differences in multilingual language modeling.
  77. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.

Summary

  • The paper introduces StingrayBench and metrics (cognate bias, comprehension score) to evaluate multilingual LLMs' cross-lingual word sense disambiguation, focusing on orthographically similar words.
  • Findings show LLMs handle true cognates well but significantly struggle with false friends, indicating limitations despite model scaling for true cognates.
  • The study highlights implications for LLM development, suggesting a need for diverse linguistic data to improve semantic understanding and real-world multilingual applications.

An Analysis of Multilingual LLMs in Cross-Lingual Word Sense Disambiguation: Evaluation and Challenges

The paper "Thank You, Stingray: Multilingual LLMs Can Not (Yet) Disambiguate Cross-Lingual Word Sense" provides a thorough investigation into the limitations of multilingual LLMs in understanding and disambiguating semantic meanings across languages. This paper introduces StingrayBench, a novel benchmark specifically crafted to measure cross-lingual word sense disambiguation involving words that are orthographically similar across languages, known as false friends and true cognates.

Key Contributions and Methodology

Central to the contributions of this work is the establishment of StingrayBench, which involves four language pairs—Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German. These language pairs were selected to illustrate the intricacies and challenges related to semantic disambiguation in multilingual LLMs. Through StingrayBench, the researchers prompt LLMs with tasks categorized into "semantic appropriateness" and "usage correction," aiming to uncover biases and the extent of comprehension exhibited by these models.

The paper introduces a novel evaluation approach using the "Stingray plot," which effectively measures the performance of LLMs in cross-lingual understanding via two newly proposed metrics: cognate bias and cognate comprehension score. These metrics quantify LLMs' biases towards higher-resource languages and their overall proficiency in differentiating cross-lingual semantic entities.

Results and Observations

The findings reveal that while LLMs exhibit notable comprehension capabilities concerning true cognates, their performance degenerates significantly when tasked with identifying and disambiguating false friends. This suggests that although LLMs may perform adequately in contexts where meanings align across languages, their ability to discern subtle semantic differences remains limited.

Analysis further highlights the scaling advantage in improving LLMs' true cognate understanding, showcasing a correlation between model size and performance. However, this scaling does not transcend to improved disambiguation of false friends, indicating a deeper systematic issue possibly related to the models' pretraining data distribution and inherent English-centric nature.

An interesting pattern emerges wherein certain language pairs—like Indonesian-Malay—pose greater challenges for LLMs, likely due to their linguistic similarities. This presents an intriguing avenue for further research, potentially addressing language representation techniques that could mitigate such biases and improve semantic differentiation capacities.

Implications and Future Directions

The paper underscores critical implications for developing more robust and equitable multilingual models. The researchers advocate for the incorporation of diverse linguistic resources and semantic frameworks during model training, suggesting adjustments that could enhance models' cross-lingual representation capacities.

From a practical standpoint, addressing these semantic limitations could significantly impact real-world applications, ranging from translation systems to multilingual NLP tools, which rely on precise word sense disambiguation to function effectively across multiple languages.

In conclusion, this paper lays a foundational framework for analyzing and improving multilingual LLMs' cross-lingual disambiguation abilities. Future research directions may involve extending StingrayBench to cover a broader spectrum of languages, exploring advanced model architectures or training paradigms, and refining evaluation metrics to foster a comprehensive understanding of multilingual semantic processing. The continued exploration of these areas holds promise for the evolution of LLMs, aiming towards more inclusive and accurate multilingual applications in natural language processing.