Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense (2410.21573v2)
Abstract: Multilingual LLMs have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive LLMing, promoting fairer access for the wider multilingual community.
- Phi-3 technical report: A highly capable language model locally on your phone. Preprint, arXiv:2404.14219.
- SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
- MasakhaNEWS: News topic classification for African languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 144–159, Nusa Dua, Bali. Association for Computational Linguistics.
- Yi: Open foundation models by 01.ai. Preprint, arXiv:2403.04652.
- V.S.D.S.Mahesh Akavarapu and Arnab Bhattacharya. 2024. Automated cognate detection as a supervised link prediction task with cognate transformer. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 965–975, St. Julian’s, Malta. Association for Computational Linguistics.
- Keith Allan. 2009. Concise encyclopedia of semantics. Elsevier.
- Aya 23: Open weight releases to further multilingual progress. Preprint, arXiv:2405.15032.
- Quentin D Atkinson. 2013. The descent of words. Proceedings of the National Academy of Sciences, 110(11):4159–4160.
- Combining noisy semantic signals with orthographic cues: Cognate induction for the indic dialect continuum. In Conference on Computational Natural Language Learning.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 749–775, Bangkok, Thailand. Association for Computational Linguistics.
- A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali. Association for Computational Linguistics.
- CogNet: A large-scale cognate database. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145, Florence, Italy. Association for Computational Linguistics.
- A large and evolving cognate database. Language Resources and Evaluation, 56:165 – 189.
- Gábor Berend. 2023. Combating the curse of multilinguality in cross-lingual wsd by aligning sparse contextualized word representations. arXiv preprint arXiv:2307.13776.
- Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information. In Proceedings of the conference-Association for Computational Linguistics. Meeting, pages 2854–2864. Association for Computational Linguistics.
- Recent trends in word sense disambiguation: A survey. In International Joint Conference on Artificial Intelligence, pages 4330–4338. International Joint Conference on Artificial Intelligence, Inc.
- Terra Blevins and Luke Zettlemoyer. 2020. Moving down the long tail of word sense disambiguation with gloss-informed biencoders. arXiv preprint arXiv:2005.02590.
- Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1352–1362.
- Samuel Cahyawijaya. 2024. Llm for everyone: Representing the underrepresented in large language models. Preprint, arXiv:2409.13897.
- NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
- Cendol: Open instruction-tuned generative large language models for Indonesian languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14899–14914, Bangkok, Thailand. Association for Computational Linguistics.
- InstructAlign: High-and-low resource language alignment via continual crosslingual instruction tuning. In Proceedings of the First Workshop in South East Asian Language Processing, pages 55–78, Nusa Dua, Bali, Indonesia. Association for Computational Linguistics.
- IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press, Edinburgh.
- A high coverage method for automatic false Friends detection for Spanish and Portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 29–36, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Cohere For AI. 2024a. c4ai-command-r-08-2024.
- Cohere For AI. 2024b. c4ai-command-r-plus-08-2024.
- A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Robocop: A comprehensive Romance BOrrowing COgnate Package and benchmark for multilingual cognate identification. In Conference on Empirical Methods in Natural Language Processing.
- Ethnologue: Languages of the World. Twenty-seventh edition. SIL International, Dallas, Texas.
- Javier Ferrando and Marta R Costa-jussà. 2024. On the similarity of circuits across languages: a case study on the subject-verb agreement task. arXiv preprint arXiv:2410.06496.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools. Preprint, arXiv:2406.12793.
- Glottolog 5.0. Max Planck Institute for Evolutionary Anthropology, Leipzig.
- Training compute-optimal large language models. Preprint, arXiv:2203.15556.
- Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
- Scaling laws for neural language models. Preprint, arXiv:2001.08361.
- Deny Arnos Kwary and Nor Hashimah Jalaluddin. 2015. The lexicography of Indonesian/Malay, page 1–11. Springer, Berlin.
- Bloom: A 176b-parameter open-access multilingual language model.
- Exploring lexical differences between Indonesian and Malay. In 2018 International Conference on Asian Language Processing (IALP), pages 178–183.
- Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017.
- Nikola Ljubesic and Darja Fišer. 2013. Identifying false friends between closely related languages. In BSNLP@ACL.
- Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. Preprint, arXiv:2406.10118.
- Mohd Sharifudin bin Yusop and Mahyudin Al Mudra. 2015. Kamus Komunikatif Nusantara: Indonesia-Malaysia, Malaysia-Indonesia. Balai Kajian dan Pengembangan Budaya Melayu, Yogyakarta.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
- Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 222–231.
- Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial intelligence, 193:217–250.
- SeaLLMs - large language models for Southeast Asia. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 294–304, Bangkok, Thailand. Association for Computational Linguistics.
- Hiroki Nomoto. 2023. Issues surrounding the use of ChatGPT in similar languages: The case of Malay and Indonesian. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–82. Association for Computational Linguistics.
- Reclassification of the Leipzig Corpora Collection for Malay and Indonesian. NUSA, 65:47–66.
- Masalah teknologi dan isu sosial berkaitan penggunaan ChatGPT dalam bahasa Melayu [Technological and social issues related to using ChatGPT in Malay]. RENTAS: Jurnal Bahasa, Sastera dan Budaya, 3(1):1–22.
- David Ong and Peerat Limkonchotiwat. 2023. SEA-LION (Southeast Asian languages in one network): A family of Southeast Asian language models. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 245–245, Singapore. Association for Computational Linguistics.
- Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Tommaso Pasini. 2021. The knowledge acquisition bottleneck problem in multilingual word sense disambiguation. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4936–4942.
- Xl-wsd: An extra-large and cross-lingual evaluation framework for word sense disambiguation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13648–13656.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2018. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121.
- Xl-wic: A multilingual benchmark for evaluating semantic contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7193–7206.
- Rusdi Abdullah. 2016. Kamus Kata Bahasa Melayu Malaysia-Bahasa Indonesia. Penerbit Universiti Kebangsaan Malaysia, Bangi.
- Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11521–11567, Bangkok, Thailand. Association for Computational Linguistics.
- Multilingual word sense disambiguation with unified sense representation. arXiv preprint arXiv:2210.07447.
- Qwen Team. 2024. Qwen2.5: A party of foundation models.
- Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
- Ana Sabina Uban and Liviu P Dinu. 2020. Automatically building a multilingual lexicon of false friends with no supervision. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3001–3007.
- Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15894–15939, Bangkok, Thailand. Association for Computational Linguistics.
- Warren Weaver. 1949. Translation. In William N. Locke and A. Donald Boothe, editors, Machine Translation of Languages, pages 15–23. MIT Press, Cambridge, MA. Reprinted from a memorandum written by Weaver in 1949.
- Do llamas work in english? on the latent language of multilingual transformers. arXiv preprint arXiv:2402.10588.
- NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
- Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
- A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136.
- mt5: A massively multilingual pre-trained text-to-text transformer. In North American Chapter of the Association for Computational Linguistics.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671.
- Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63, Singapore. Association for Computational Linguistics.
- Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
- Ruochen Zhang and Carsten Eickhoff. 2024. CroCoSum: A benchmark dataset for cross-lingual code-switched summarization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4113–4126, Torino, Italia. ELRA and ICCL.
- The same but different: Structural similarities and differences in multilingual language modeling.
- Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.