Evaluating Multilingual Long-Context Models for Retrieval and Reasoning (2409.18006v3)

Published 26 Sep 2024 in cs.CL

Abstract: Recent LLMs demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.

Summary

The paper introduces the mLongRR dataset to benchmark retrieval and reasoning in multilingual LLMs across both high- and low-resource languages.
It employs realistic tasks—'needle in a haystack' retrieval and multiple-needle reasoning—to reveal performance declines as context complexity increases.
Results highlight the impact of tokenization efficiency and show advanced models like Gemini-1.5 and GPT-4o achieve relatively robust outcomes despite inherent multilingual challenges.

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

The paper presented provides an empirical investigation into the performance of multilingual long-context LLMs for retrieval and reasoning tasks. This research primarily seeks to address an existing gap in multilingual evaluation for LLMs, emphasizing a comprehensive analysis across a newly proposed dataset, mLongRR.

Contribution and Methodology

The paper introduces the mLongRR dataset, purpose-built to assess multilingual LLM performance. This dataset spans five languages—English, Vietnamese, Indonesian, Swahili, and Somali—tallied over different language families and resource levels, yet unified in script. The selected languages allow for an examination of potential retrieval and reasoning disparities among high-resource (e.g., English) and low-resource languages (e.g., Somali). This framework serves as an experimental bedrock from which LLMs like Gemini-1.5, GPT-4o, Claude-3, YaRN-Llama-2-7b, and Llama-3 are rigorously evaluated.

The research defines tasks that simulate retrieval and reasoning scenarios wherein models are required to find and manipulate textual information—a methodological nod to realistic multilingual use cases. The retrieval task (a “needle in a haystack”) and the more complex reasoning task (using multiple needles) aim to stretch the long-context capabilities of the models across vast textual landscapes.

Results and Analysis

Results from the evaluation highlight several insights and challenges confronting multilingual LLMs in long-context scenarios:

Performance Decline with Complexity: As task complexity increases—from single needle retrieval to multi-needle reasoning—the accuracy for all models declines sharply. This decrease is particularly stark for reasoning tasks in low-resource languages, emphasizing the challenge faced by current LLM architectures in multi-step reasoning and recall over extended contexts.
Impact of Language Resource Level: There is a noted performance disparity between languages. LLMs perform best on English and Vietnamese, with significant declines in lower resource languages like Swahili and Somali. This observation underscores a continuing challenge in NLP concerning the handling of low-resource languages.
Tokenization Influence: Another discernable trend is the tokenization efficiency's impact on model performance. Languages with lower tokenization rates (an indicator of processing ease) typically showed better LLM performance. This illustrates the critical role tailored tokenization methods play in enhancing multilingual LLM effectiveness.
Variation Among Models: Among the tested models, Gemini-1.5 and GPT-4o generally exhibit more robust performance across scenarios and languages, likely due to their advanced architectures and large context windows. However, even these models demonstrated notable limitations in multilingual reasoning capabilities as context length and complexity increased.

Implications and Future Directions

The implications of this paper are multifaceted. Practically, the observed limitations suggest that existing LLM technology may not yet be ready for complex multilingual tasks in low-resource contexts without substantial customization or supplementary data. Theoretically, the work draws attention to necessitated advancements in model architectures such as improved reasoning mechanisms and deeper contextual understanding, particularly across diverse and resource-scarce languages.

Anticipated future developments in AI, stimulated by this research, might include the development of more sophisticated tokenization algorithms, dedicated large-scale multilingual training datasets, and architectural innovations to better handle long contexts and low-resource applications.

By outlining these findings, the paper effectively positions itself as a timely exploratory step, signposting the dual necessity of broad multilingual benchmarking and the continuous refinement of LLMs in handling extensive contexts more gracefully and safely.

PDF Markdown