Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages
The research paper "Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages" offers a detailed analysis of the performance of multilingual LLMs when tasked with generating factually accurate responses. The authors investigate the disparity between the efficacy of LLMs in high-resource languages like English and low-resource languages, such as various Indic languages. They employ the IndicQuest dataset across 20 linguistic settings, including English and 19 Indic languages, thereby providing a robust framework for evaluation.
The paper leverages multiple state-of-the-art LLMs, namely GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B, analyzing their responses across multiple domains including geography, history, politics, economics, and literature. The assessment focuses on six key metrics: factual accuracy, relevance, clarity, language consistency, conciseness, and an overall performance score. This multi-faceted evaluation reveals stark contrasts in factual accuracy, particularly highlighting English as consistently superior across all metrics and domains.
The paper identifies significant challenges faced by LLMs in low-resource languages, illustrated via frequent hallucinations and inaccuracies in response generation. Notably, languages classified under moderate performance metrics include Hindi, Marathi, Maithili, Nepali, and others, while languages like Odia and Urdu fall under extremely low performance metrics. These findings underscore the complexities involved in LLMing, especially in less digitized linguistic landscapes.
A critical insight from the paper is the evident relationship between model size and performance. Larger models such as GPT-4o consistently achieve higher accuracy scores, validating the hypothesis that increased model capacity facilitates better knowledge retention. However, despite advancements, the paper highlights a persistent performance gap in domains reliant on region-specific knowledge, such as history and geography. Economics and politics domains, conversely, exhibit relatively better accuracy, perhaps due to heightened representation within training datasets.
The analysis further explores language consistency, uncovering a positive correlation with factual accuracy. Models exhibiting higher linguistic fluency generally achieve better grounding in factual content, suggesting areas for potential optimization. The paper also presents concrete examples of factual inaccuracies, reinforcing the need for more refined approaches to model training and evaluation in low-resource settings.
Implications from this research extend into practical and theoretical domains. Practically, it stresses the necessity for improved dataset enrichment and domain-specific optimization tailored to Indic languages. Theoretically, it calls for expanded methodologies and metrics to better encapsulate the nuances and demands of multilingual understanding.
In conclusion, the paper offers substantial evidence of the challenges inherent in multilingual LLM development, especially concerning low-resource languages. It advocates for targeted efforts to enhance data resources, model architectures, and evaluation frameworks, fostering advancements in factual reliability across diverse linguistic fields. Future research endeavors should aim to bridge this gap, offering equitable LLM performance irrespective of resource availability.