MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment (2410.05873v1)

Published 8 Oct 2024 in cs.CL and cs.AI

Abstract: English-centric LLMs often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://huggingface.co/spaces/cis-lmu/Mexa, Code: https://github.com/cisnlp/Mexa.

Authors (6)

Amir Hossein Kargaran (16 papers)
Ali Modarressi (16 papers)
Nafiseh Nikeghbal (4 papers)
Jana Diesner (21 papers)
François Yvon (49 papers)
Hinrich Schütze (250 papers)

Citations (1)

View on Semantic Scholar

Summary

Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

The paper presents a novel method for evaluating the multilingual capabilities of English-centric LLMs. This approach addresses the gap in comprehensive multilingual performance assessments by leveraging cross-lingual alignment through parallel sentences.

Overview of the Method

The authors introduce a method called , which assesses the alignment between English and other languages within LLMs by examining parallel sentences. The alignment serves as a proxy for estimating the multilingual understanding capabilities of these models, facilitating a more accurate prediction of their performance across diverse languages.

Experimental Setup

The paper utilizes various parallel datasets, such as FLORES-200 and the Bible, along with a selection of LLMs including the Llama, Gemma, Mistral, and OLMo families. The authors also incorporate downstream tasks such as Belebele, m-MMLU, and m-ARC to establish benchmarks for evaluation.

Results and Findings

The results indicate a high average Pearson correlation of 0.90 when comparing scores from the proposed alignment method with established downstream tasks. This suggests a strong reliability in using the scores as indicators of multilingual potential. Additionally, the analysis reveals distinctions in model performance, highlighting the advanced multilingual abilities of models like Gemma 2 and Llama 3.1-70B compared to others such as OLMo.

Analysis of Sentence Embeddings

Two main strategies for computing sentence embeddings—weighted average and last token embeddings—are explored. The paper demonstrates the efficacy of using weighted average embeddings alongside mean pooling, which consistently yields the most precise alignment scores across different languages and model layers.

Implications and Future Directions

This method provides substantial insight into the cross-lingual alignment capabilities within LLMs, shedding light on their structural multilingualism through a detailed examination of sentence embeddings. The insights gained from this paper are crucial for further model development and enhancement of multilingual performance, paving the way for more equitable language understanding across underrepresented languages.

Future research could expand on these findings by exploring additional language script combinations and probing deeper into the inner workings of LLM layers. The approach also highlights the need for developing more extensive and diverse multilingual benchmarks, potentially incorporating cultural and language-specific nuances to further refine the understanding of LLM capabilities.

Overall, this research contributes significantly to the field of multilingual NLP by providing a robust framework for evaluating and understanding the multilingual potential of English-centric LLMs through innovative cross-lingual alignment methods.

PDF Markdown

Related Papers

GitHub

GitHub - cisnlp/Mexa: MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment (5 stars)

Tweets

https://twitter.com/amir_nlp/status/1844381560814096552

https://twitter.com/arXivGPT/status/1844882759073849543