Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

Published 28 Apr 2025 in cs.CL and cs.LG | (2504.20022v1)

Abstract: Multilingual LLMs have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper evaluates multilingual LLM factual accuracy using the IndicQuest dataset across 20 languages, comparing state-of-the-art models like GPT-4o and Gemma-2 across multiple domains and metrics.
Findings show significant factual accuracy disparities, with English outperforming all Indic languages, highlighting challenges like hallucinations in low-resource settings.
The study suggests larger models perform better but calls for improved data enrichment and tailored optimization for low-resource languages to achieve equitable performance across linguistic fields.

Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

The research paper "Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages" offers a detailed analysis of the performance of multilingual LLMs when tasked with generating factually accurate responses. The authors investigate the disparity between the efficacy of LLMs in high-resource languages like English and low-resource languages, such as various Indic languages. They employ the IndicQuest dataset across 20 linguistic settings, including English and 19 Indic languages, thereby providing a robust framework for evaluation.

The study leverages multiple state-of-the-art LLMs, namely GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B, analyzing their responses across multiple domains including geography, history, politics, economics, and literature. The assessment focuses on six key metrics: factual accuracy, relevance, clarity, language consistency, conciseness, and an overall performance score. This multi-faceted evaluation reveals stark contrasts in factual accuracy, particularly highlighting English as consistently superior across all metrics and domains.

The paper identifies significant challenges faced by LLMs in low-resource languages, illustrated via frequent hallucinations and inaccuracies in response generation. Notably, languages classified under moderate performance metrics include Hindi, Marathi, Maithili, Nepali, and others, while languages like Odia and Urdu fall under extremely low performance metrics. These findings underscore the complexities involved in language modeling, especially in less digitized linguistic landscapes.

A critical insight from the paper is the evident relationship between model size and performance. Larger models such as GPT-4o consistently achieve higher accuracy scores, validating the hypothesis that increased model capacity facilitates better knowledge retention. However, despite advancements, the study highlights a persistent performance gap in domains reliant on region-specific knowledge, such as history and geography. Economics and politics domains, conversely, exhibit relatively better accuracy, perhaps due to heightened representation within training datasets.

The analysis further explores language consistency, uncovering a positive correlation with factual accuracy. Models exhibiting higher linguistic fluency generally achieve better grounding in factual content, suggesting areas for potential optimization. The paper also presents concrete examples of factual inaccuracies, reinforcing the need for more refined approaches to model training and evaluation in low-resource settings.

Implications from this research extend into practical and theoretical domains. Practically, it stresses the necessity for improved dataset enrichment and domain-specific optimization tailored to Indic languages. Theoretically, it calls for expanded methodologies and metrics to better encapsulate the nuances and demands of multilingual understanding.

In conclusion, the study offers substantial evidence of the challenges inherent in multilingual LLM development, especially concerning low-resource languages. It advocates for targeted efforts to enhance data resources, model architectures, and evaluation frameworks, fostering advancements in factual reliability across diverse linguistic fields. Future research endeavors should aim to bridge this gap, offering equitable LLM performance irrespective of resource availability.

Markdown Report Issue