IndicGenBench: A Comprehensive Benchmark for Evaluating LLMs on Multilingual Generation Tasks in 29 Indic Languages
Overview of IndicGenBench
IndicGenBench is a newly introduced benchmark designed to assess the generation capabilities of LLMs for 29 Indic languages spanning 13 scripts and four language families. This benchmark focuses on a variety of generative tasks pertinent to user interactions, including machine translation, cross-lingual summarization, and question answering both in multilingual and cross-lingual formats.
Dataset Composition
IndicGenBench includes five distinct tasks that are essential for evaluating the performance of LLMs in real-world scenarios:
- Cross-Lingual Summarization (CrossSum-In): Tasks involving summarizing content from one language in another.
- Machine Translation (Flores-In): Translation tasks both from and into English for each of the target Indic languages.
- Multilingual Question Answering (XQuAD-In): Involves answering questions from a given passage in the same language.
- Cross-Lingual Question Answering (XorQA-In-Xx and XorQA-In-En): Asking questions in one language and expecting answers from a passage in another.
These tasks are supplemented by parallel evaluation data that, for some languages, mark the first instance of availability for generative task assessment.
Methodology
The datasets for IndicGenBench were curated by extending existing datasets via professional translation into underrepresented Indic languages. This approach allows for maintaining consistency in data quality across languages. The collection breakdown by language resourcefulness categorizes languages into higher, medium, and low based on web text availability, which is crucial for meaningful evaluation across diverse linguistic settings.
Performance Evaluation
The benchmark evaluates several state-of-the-art LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM, and LLaMA. The models are tested in one-shot settings, and performance is measured in specific metrics tailored to each task such as Character F1 and Token F1 scores. Results indicate significant performance discrepancies between English and Indic languages, with PaLM-2 models generally outperforming others across most tasks.
Implications and Future Directions
The testing underscores a distinct performance drop for languages with lower resource availability, highlighting a pressing need for model improvements in these areas. The results also suggest that the current LLMs are more adept at understanding low-resource languages rather than generating fluent text in them. Future research could, therefore, focus on enhancing model capabilities for text generation in low-resource languages and further exploring the impact of training data size and quality on model performance. Additionally, the variances in tokenizer efficiency across languages suggest room for improvement in pre-processing steps for non-European languages.
IndicGenBench provides a crucial platform for advancing LLM technology for Indic languages, potentially impacting over a billion people with enhanced linguistic representation in digital communications. Moreover, this benchmark could serve as a model for developing similar resources for other language groups, thereby promoting broader inclusivity in language technology developments.