Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense (2410.21573v2)

Published 28 Oct 2024 in cs.CL and cs.AI

Abstract: Multilingual LLMs have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive LLMing, promoting fairer access for the wider multilingual community.

References (77)

Summary

The paper introduces StingrayBench and metrics (cognate bias, comprehension score) to evaluate multilingual LLMs' cross-lingual word sense disambiguation, focusing on orthographically similar words.
Findings show LLMs handle true cognates well but significantly struggle with false friends, indicating limitations despite model scaling for true cognates.
The study highlights implications for LLM development, suggesting a need for diverse linguistic data to improve semantic understanding and real-world multilingual applications.

An Analysis of Multilingual LLMs in Cross-Lingual Word Sense Disambiguation: Evaluation and Challenges

The paper "Thank You, Stingray: Multilingual LLMs Can Not (Yet) Disambiguate Cross-Lingual Word Sense" provides a thorough investigation into the limitations of multilingual LLMs in understanding and disambiguating semantic meanings across languages. This paper introduces StingrayBench, a novel benchmark specifically crafted to measure cross-lingual word sense disambiguation involving words that are orthographically similar across languages, known as false friends and true cognates.

Key Contributions and Methodology

Central to the contributions of this work is the establishment of StingrayBench, which involves four language pairs—Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German. These language pairs were selected to illustrate the intricacies and challenges related to semantic disambiguation in multilingual LLMs. Through StingrayBench, the researchers prompt LLMs with tasks categorized into "semantic appropriateness" and "usage correction," aiming to uncover biases and the extent of comprehension exhibited by these models.

The paper introduces a novel evaluation approach using the "Stingray plot," which effectively measures the performance of LLMs in cross-lingual understanding via two newly proposed metrics: cognate bias and cognate comprehension score. These metrics quantify LLMs' biases towards higher-resource languages and their overall proficiency in differentiating cross-lingual semantic entities.

Results and Observations

The findings reveal that while LLMs exhibit notable comprehension capabilities concerning true cognates, their performance degenerates significantly when tasked with identifying and disambiguating false friends. This suggests that although LLMs may perform adequately in contexts where meanings align across languages, their ability to discern subtle semantic differences remains limited.

Analysis further highlights the scaling advantage in improving LLMs' true cognate understanding, showcasing a correlation between model size and performance. However, this scaling does not transcend to improved disambiguation of false friends, indicating a deeper systematic issue possibly related to the models' pretraining data distribution and inherent English-centric nature.

An interesting pattern emerges wherein certain language pairs—like Indonesian-Malay—pose greater challenges for LLMs, likely due to their linguistic similarities. This presents an intriguing avenue for further research, potentially addressing language representation techniques that could mitigate such biases and improve semantic differentiation capacities.

Implications and Future Directions

The paper underscores critical implications for developing more robust and equitable multilingual models. The researchers advocate for the incorporation of diverse linguistic resources and semantic frameworks during model training, suggesting adjustments that could enhance models' cross-lingual representation capacities.

From a practical standpoint, addressing these semantic limitations could significantly impact real-world applications, ranging from translation systems to multilingual NLP tools, which rely on precise word sense disambiguation to function effectively across multiple languages.

In conclusion, this paper lays a foundational framework for analyzing and improving multilingual LLMs' cross-lingual disambiguation abilities. Future research directions may involve extending StingrayBench to cover a broader spectrum of languages, exploring advanced model architectures or training paradigms, and refining evaluation metrics to foster a comprehensive understanding of multilingual semantic processing. The continued exploration of these areas holds promise for the evolution of LLMs, aiming towards more inclusive and accurate multilingual applications in natural language processing.

PDF Markdown

Tweets

https://twitter.com/AlhamFikri/status/1853652688879374642

https://twitter.com/SCahyawijaya/status/1853556951218544989