IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages (2404.16816v1)

Published 25 Apr 2024 in cs.CL

Abstract: As LLMs see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual LLMs. IndicGenBench is released at www.github.com/google-research-datasets/indic-gen-bench

PDF Abstract

IndicGenBench: A Comprehensive Benchmark for Evaluating LLMs on Multilingual Generation Tasks in 29 Indic Languages

Overview of IndicGenBench

IndicGenBench is a newly introduced benchmark designed to assess the generation capabilities of LLMs for 29 Indic languages spanning 13 scripts and four language families. This benchmark focuses on a variety of generative tasks pertinent to user interactions, including machine translation, cross-lingual summarization, and question answering both in multilingual and cross-lingual formats.

Dataset Composition

IndicGenBench includes five distinct tasks that are essential for evaluating the performance of LLMs in real-world scenarios:

Cross-Lingual Summarization (CrossSum-In): Tasks involving summarizing content from one language in another.
Machine Translation (Flores-In): Translation tasks both from and into English for each of the target Indic languages.
Multilingual Question Answering (XQuAD-In): Involves answering questions from a given passage in the same language.
Cross-Lingual Question Answering (XorQA-In-Xx and XorQA-In-En): Asking questions in one language and expecting answers from a passage in another.

These tasks are supplemented by parallel evaluation data that, for some languages, mark the first instance of availability for generative task assessment.

Methodology

The datasets for IndicGenBench were curated by extending existing datasets via professional translation into underrepresented Indic languages. This approach allows for maintaining consistency in data quality across languages. The collection breakdown by language resourcefulness categorizes languages into higher, medium, and low based on web text availability, which is crucial for meaningful evaluation across diverse linguistic settings.

Performance Evaluation

The benchmark evaluates several state-of-the-art LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM, and LLaMA. The models are tested in one-shot settings, and performance is measured in specific metrics tailored to each task such as Character F1 and Token F1 scores. Results indicate significant performance discrepancies between English and Indic languages, with PaLM-2 models generally outperforming others across most tasks.

Implications and Future Directions

The testing underscores a distinct performance drop for languages with lower resource availability, highlighting a pressing need for model improvements in these areas. The results also suggest that the current LLMs are more adept at understanding low-resource languages rather than generating fluent text in them. Future research could, therefore, focus on enhancing model capabilities for text generation in low-resource languages and further exploring the impact of training data size and quality on model performance. Additionally, the variances in tokenizer efficiency across languages suggest room for improvement in pre-processing steps for non-European languages.

IndicGenBench provides a crucial platform for advancing LLM technology for Indic languages, potentially impacting over a billion people with enhanced linguistic representation in digital communications. Moreover, this benchmark could serve as a model for developing similar resources for other language groups, thereby promoting broader inclusivity in language technology developments.