Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis (2507.23248v1)

Published 31 Jul 2025 in cs.CL and cs.LG

Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source LLMs in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient & concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-LLM-benchmark.

Summary

The paper presents a novel benchmark creation process by translating English datasets into Bengali using the gpt-4o-mini model and rigorous post-processing.
It evaluates 10 state-of-the-art LLMs on eight tasks, revealing significant performance gaps and tokenization inefficiencies in Bengali compared to English.
Findings underscore the need for tailored multilingual evaluation metrics and optimized tokenization strategies for underrepresented languages.

Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Introduction

The research paper under discussion assesses the multilingual abilities of LLMs with a specific focus on Bengali, a language that poses unique challenges due to its complex linguistic structure and the scarcity of computational resources. The paper meticulously highlights the absence of standardized evaluation benchmarks for Bengali NLP tasks, which impedes the development of robust LLMs for this language. Through its comprehensive evaluation, the paper aims to bridge these gaps by providing high-quality benchmark datasets, assessing tokenization behavior, and analyzing model performance.

Methodology

The authors adopted a methodical approach to translate existing English NLP benchmark datasets into Bengali using a specialized translation pipeline. This process involved dataset selection, translation using the gpt-4o-mini model, and an extensive post-processing phase to ensure quality and consistency.

Figure 1: Methodology Overview.

Dataset Selection and Translation

The dataset selection adhered to a rigorous process that ensured the inclusion of representative text corpora aligned with the research objectives. This careful selection allowed for a comprehensive assessment of model capabilities across multiple tasks. The datasets chosen encapsulate tasks related to logical reasoning, commonsense understanding, and more, providing a broad spectrum for analysis.

Translation Pipeline

Utilizing the gpt-4o-mini model, the paper achieved accurate translations of English datasets into Bengali, overcoming challenges such as repetitive translations and decoding errors. The translation process, depicted in Figure 1, was further refined via human-assisted reviews to enhance cultural and contextual appropriateness.

Experimental Setup

The research evaluated 10 state-of-the-art open-source LLMs on eight distinct English-to-Bengali translated datasets. These included models from various architecture families such as LLaMA, Qwen, Mistral, and DeepSeek.

Evaluation Metrics

Several key metrics were employed to assess model performance:

Accuracy: Indicative of the proportion of correctly answered questions.
Response Error Rate (RER) and Response Adherence Rate (RAR): These metrics measure the conformity of model responses to valid answer formats.
LLM-Judge: A novel metric utilizing an LLM-based judge to evaluate the semantic similarity of model answers to ground truth.

Figure 2: Average of Accuracy, LLM-Judge, and RAR scores across datasets grouped by language.

Result Analysis

The analysis revealed consistent performance discrepancies between Bengali and English datasets, with models often underperforming in the former. Notably, smaller models and certain families like Mistral exhibited significant gaps, whereas models like DeepSeek demonstrated stability across languages.

Figure 3: LLM-Judge score distributions across different model architecture families.

Tokenization Analysis

The paper scrutinized the tokenization efficiency of the models, revealing that Bengali text incurs higher token counts per instance and word compared to English. This inefficiency negatively impacts model performance, drawing attention to the need for optimized tokenization strategies.

Figure 4: ABPT.

Conclusion

The research presents critical insights into the multilingual capabilities of LLMs, particularly for underrepresented languages like Bengali. The authors have successfully developed benchmark datasets and methodologies that pave the way for future research in this domain. The paper underscores the necessity of improved dataset quality and appraisal techniques tailored to multilingual contexts, with the aim of democratizing NLP technologies.

Future Directions

To build on this foundation, future research should prioritize the development of linguistically nuanced datasets and explore more flexible evaluation criteria. Additionally, advancing tokenization techniques specific to complex scripts like Bengali could enhance model efficiency and accuracy.

In summary, this paper provides a robust framework for evaluating multilingual LLMs and highlights the pathways for advancing NLP research in underrepresented languages. The comprehensive benchmarks and methodologies introduced serve as crucial stepping stones towards more inclusive and effective language technologies.