- The paper presents a novel evaluation method that leverages cross-lingual alignment through parallel sentences to assess the multilingual potential of English-centric LLMs with a high Pearson correlation of 0.90.
- It employs weighted average sentence embeddings with mean pooling to consistently yield precise alignment scores across diverse languages and model layers.
- Experimental results using datasets like FLORES-200 and the Bible validate the approach, highlighting superior multilingual performance in models such as Gemma 2 and Llama 3.1-70B.
Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
The paper presents a novel method for evaluating the multilingual capabilities of English-centric LLMs. This approach addresses the gap in comprehensive multilingual performance assessments by leveraging cross-lingual alignment through parallel sentences.
Overview of the Method
The authors introduce a method called , which assesses the alignment between English and other languages within LLMs by examining parallel sentences. The alignment serves as a proxy for estimating the multilingual understanding capabilities of these models, facilitating a more accurate prediction of their performance across diverse languages.
Experimental Setup
The paper utilizes various parallel datasets, such as FLORES-200 and the Bible, along with a selection of LLMs including the Llama, Gemma, Mistral, and OLMo families. The authors also incorporate downstream tasks such as Belebele, m-MMLU, and m-ARC to establish benchmarks for evaluation.
Results and Findings
The results indicate a high average Pearson correlation of 0.90 when comparing scores from the proposed alignment method with established downstream tasks. This suggests a strong reliability in using the scores as indicators of multilingual potential. Additionally, the analysis reveals distinctions in model performance, highlighting the advanced multilingual abilities of models like Gemma 2 and Llama 3.1-70B compared to others such as OLMo.
Analysis of Sentence Embeddings
Two main strategies for computing sentence embeddings—weighted average and last token embeddings—are explored. The paper demonstrates the efficacy of using weighted average embeddings alongside mean pooling, which consistently yields the most precise alignment scores across different languages and model layers.
Implications and Future Directions
This method provides substantial insight into the cross-lingual alignment capabilities within LLMs, shedding light on their structural multilingualism through a detailed examination of sentence embeddings. The insights gained from this paper are crucial for further model development and enhancement of multilingual performance, paving the way for more equitable language understanding across underrepresented languages.
Future research could expand on these findings by exploring additional language script combinations and probing deeper into the inner workings of LLM layers. The approach also highlights the need for developing more extensive and diverse multilingual benchmarks, potentially incorporating cultural and language-specific nuances to further refine the understanding of LLM capabilities.
Overall, this research contributes significantly to the field of multilingual NLP by providing a robust framework for evaluating and understanding the multilingual potential of English-centric LLMs through innovative cross-lingual alignment methods.