BanglaMMedBench: Dual Bengali Benchmarks
- BanglaMMedBench is a dual-use benchmark suite that includes a translated multilingual evaluation set and a scenario-based biomedical MCQ dataset.
- It employs a robust translation pipeline using models like GPT-4o-mini combined with human review to ensure high-quality adaptation from English benchmarks.
- Evaluation findings indicate challenges with tokenization efficiency and response-format adherence, highlighting language-specific performance gaps relative to English.
Searching arXiv for papers on BanglaMMedBench and closely related Bangla benchmarking work. BanglaMMedBench is a name used for two distinct Bangla benchmark constructions in recent arXiv literature. In one usage, it denotes a Bengali benchmark suite for evaluating multilingual LLMs on translated, task-diverse NLP datasets, with emphasis on English-to-Bengali transfer, response-format adherence, and tokenization effects (Bhowmik et al., 31 Jul 2025). In another usage, it denotes a 1,000-question Bangla biomedical multiple-choice benchmark derived from the English MMedBench and designed for scenario-based, clinically oriented reasoning and retrieval-augmented generation analysis (Sultana et al., 6 Nov 2025). The shared label reflects a common concern—standardized evaluation for an underrepresented language—but the two resources differ in domain, task design, and intended failure analysis. This suggests that precise citation by paper, rather than by benchmark name alone, is necessary in technical discussion.
1. Terminological scope and disambiguation
In the current literature, “BanglaMMedBench” does not identify a single canonical dataset. One line of work uses the name for a translated Bengali evaluation suite spanning general reasoning and knowledge tasks; another uses it for a biomedical MCQ benchmark focused on clinical scenarios.
| Usage | Paper | Scope |
|---|---|---|
| BanglaMMedBench as multilingual Bengali benchmark suite | (Bhowmik et al., 31 Jul 2025) | Translated general LLM evaluation datasets for Bengali |
| BanglaMMedBench as biomedical MCQ benchmark | (Sultana et al., 6 Nov 2025) | Scenario-based biomedical reasoning in Bangla |
This terminological overlap is important because the two benchmarks support different research questions. The first is concerned with multilingual transfer from English to Bengali, benchmark construction, and failure modes such as excessive tokenization, response-format instability, and translation artifacts. The second is concerned with clinically oriented biomedical reasoning, retrieval strategy evaluation, and the extent to which Bangla medical question answering can benefit from zero-shot generation or retrieval augmentation.
A common misconception is that BanglaMMedBench is a single monolithic resource. That characterization is accurate for neither usage. In the multilingual-transfer paper, BanglaMMedBench is explicitly “not introduced as a single monolithic dataset, but as a benchmark collection built by translating major English LLM evaluation datasets into Bengali, cleaning them, and then using a subset of them for model evaluation” (Bhowmik et al., 31 Jul 2025). In the biomedical paper, BanglaMMedBench is one of two datasets in a broader Bangla biomedical QA study and is specifically paired with BanglaMedQA (Sultana et al., 6 Nov 2025).
2. BanglaMMedBench as a Bengali multilingual LLM benchmark suite
In the multilingual-transfer usage, BanglaMMedBench was created to address the absence of a standardized, high-quality Bengali evaluation benchmark comparable to English benchmarks such as MMLU, ARC, HellaSwag, and GSM8K (Bhowmik et al., 31 Jul 2025). The broader translated set comprises “twenty major LLM benchmark datasets,” while the evaluation experiments use eight cleaned Bengali datasets spanning four broad categories:
- Commonsense: HellaSwag, Winogrande, CommonsenseQA, BoolQ, OpenBookQA
- Science: ARC
- Math: GSM8K-Main
- Multidomain: MMLU
The benchmark is therefore task-diverse rather than domain-specific. It probes commonsense reasoning, science reasoning, math reasoning, multi-domain knowledge and reasoning, instruction following or generation-adjacent behavior in translated corpora, and response-format adherence. The answer formats include multiple-choice, binary QA, and exact-match style answers. This breadth is central to the benchmark’s design because the paper’s objective is not only to compare Bengali and English performance, but also to identify which capability classes are most fragile under translation and multilingual transfer.
The construction pipeline relies on machine translation plus human judgment. The authors used OpenAI GPT-4o-mini-2024-07-18 as the final translation model, after a blind review of outputs from Google Translate, Azure Translation Endpoint, and OpenAI GPT-4o-mini. Human reviewers evaluated translations without knowing which system produced them, and GPT-4o-mini was selected as the best overall translator. The subsequent cleanup process included fixing repeated or unnatural phrasing, reprocessing missing items from multithreading failures, correcting decoding and JSON parsing errors, retranslating incomplete outputs, and preserving labels, options, and answer keys. The reported total translation cost was about $200 (Bhowmik et al., 31 Jul 2025).
The resulting benchmark is best understood as a standardized Bengali evaluation layer over major English benchmark families rather than as a newly authored native Bangla corpus. This suggests that its principal methodological contribution lies in reproducible cross-lingual benchmarking and error diagnosis, not in native data collection.
3. Evaluation methodology, tokenization analysis, and findings in the multilingual-transfer benchmark
The multilingual-transfer paper evaluates 10 recent open-source multilingual/instruct-tuned LLMs with no fine-tuning: LLaMA 3.1 8B, LLaMA 3.1 70B, LLaMA 3.2 3B, LLaMA 3.3 70B, Qwen 2.5 7B, Qwen 2.5 72B, Mistral 7B, Mistral Small 24B, DeepSeek-R1 14B, and DeepSeek-R1 70B (Bhowmik et al., 31 Jul 2025). The evaluation protocol combines strict exact matching, format-adherence analysis, and semantic judging. For closed-form tasks, the paper uses accuracy:
Because many tasks require specific output prefixes or labels, the paper also introduces Response Error Rate (RER), defined as the fraction of responses that do not begin with any valid answer prefix, and Response Adherence Rate (RAR), defined as . For cases where exact string matching is too strict, it adds an LLM-Judge based on GPT-style few-shot prompting to determine whether a model answer conveys the same meaning as the gold answer.
The reported empirical pattern is consistent across almost all models and tasks: English performance is higher than Bengali, the gap is larger for smaller models, and family-level robustness is uneven. Mistral models underperform consistently across both languages, while DeepSeek models are more robust across languages. Earlier LLaMA models show larger English–Bengali drops, and Qwen 72B is strong overall but not uniformly stable in Bengali. At the task level, the language gap is especially pronounced in math, especially GSM8K, and in commonsense reasoning, especially HellaSwag and OpenBookQA. The paper also highlights severe response-format instability in Mistral 7B, which exhibits extremely poor accuracy on many tasks and very high RER.
A central contribution of this benchmark is its tokenization study. The paper defines ATPR (average tokens per row), ATPW (average tokens per word), ABPT (average bytes per token), and ANSL (average normalized sequence length). The core finding is that Bengali inputs produce substantially more tokens than English, datasets such as BoolQ and HellaSwag can exceed 1000 tokens per row, especially in Bengali, and Bengali often shows 2–7 tokens per word, whereas English is consistently more token-efficient. The authors argue for an inverse relationship between excessive tokenization and model quality: higher tokens per row tend to correlate with lower scores, and more compact tokenization per word tends to be associated with better performance. This establishes tokenization efficiency as a benchmark variable rather than merely an implementation detail.
The failure analysis separates translation-side and model-side problems. Translation cleanup addressed repetitive translations, missing entries, JSON decoding failures, malformed outputs such as missing commas or unclosed quotes, and incomplete translations with missing answer keys or options. Model-side errors include failure to follow the expected response format, brittle performance on Bengali prompts, poor answer-choice selection reliability, and degradation on math and commonsense reasoning. Taken together, these findings position BanglaMMedBench, in this usage, as both an evaluation suite and a diagnostic instrument for multilingual Bengali LLM behavior (Bhowmik et al., 31 Jul 2025).
4. BanglaMMedBench as a biomedical MCQ reasoning benchmark
In the biomedical usage, BanglaMMedBench is one of two Bangla biomedical multiple-choice QA datasets introduced alongside BanglaMedQA (Sultana et al., 6 Nov 2025). Its stated purpose is to evaluate scenario-based / clinically oriented biomedical reasoning in Bangla, rather than simple factual recall. The paper argues that existing Bangla medical exam questions are mostly factual recall, while English biomedical QA benchmarks more often test patient-specific, context-dependent reasoning. BanglaMMedBench was therefore created by extending the English MMedBench of Qiu et al. (2024) into Bangla.
The distinction between the two datasets in the paper is explicit. BanglaMedQA covers foundational, admission-test-style medical knowledge in Bangla. BanglaMMedBench covers more complex, situational, reasoning-heavy biomedical MCQs with rationales. Each BanglaMMedBench item includes a question, four options, a correct answer, and a rationale/explanation. The questions are described as situational and clinically oriented, inspired by USMLE-style reasoning in the original MMedBench.
The construction workflow begins with translation-model selection on a 50-question sample, manually reviewed by a medical expert. The compared systems were Google Translate, Mistral Saba, LLaMA 3 70B, Gemini, and ChatGPT. The reported qualitative findings were that Mistral Saba struggled with medical terminology, LLaMA 3 70B often misread options, Google Translate introduced spelling and syntactic problems, and Gemini and ChatGPT preserved terminology and grammar better. On the basis of expert review, quality, and efficiency, the authors selected Gemini-1.5-Flash for batch translation. Post-processing then removed incorrectly translated or incomplete items, fixed formatting issues in options and rationales, and verified alignment between translated answers, rationales, and the original source. Preprocessing further applied Unicode normalization, whitespace normalization, parsing of embedded options from question text, and enforcement of exactly four options per record (Sultana et al., 6 Nov 2025).
The final reported size is 1,000 questions. The paper describes BanglaMMedBench as a Bangla translation of 1,000 English MMedBench items, and the total across BanglaMedQA and BanglaMMedBench is 2,000 MCQs. A plausible implication is that the authors intended the pair to function as a two-level biomedical evaluation suite: one component anchored in foundational exam-style knowledge, the other in scenario-based clinical reasoning.
5. Retrieval setting and reported results on the biomedical benchmark
The biomedical BanglaMMedBench is evaluated primarily in two settings: Zero-shot evaluation and Web search + zero-shot fallback evaluation (Sultana et al., 6 Nov 2025). For each MCQ, the system constructs a structured prompt from the question and four options, the model predicts the correct option, rationales are generated when applicable, and each run is subject to a 300-second timeout. The paper reports accuracy as the primary quantitative metric for BanglaMMedBench:
Although the broader paper benchmarks Traditional RAG, Zero-Shot Fallback, Agentic RAG, Iterative Feedback RAG, and Aggregate k-values RAG, the direct BanglaMMedBench comparison table focuses on zero-shot versus web search. The reported BanglaMMedBench accuracies are:
- Zero-shot
- llama-3.3-70b-versatile: 62.08%
- openai/gpt-oss-120b: 90.59%
- Web search
- llama-3.3-70b-versatile: 60.00%
- openai/gpt-oss-120b: 82.97%
The same table reports English MMedBench comparison scores of 88.90% and 92.47% in zero-shot, and 83.86% and 89.90% in web search, for the same two models. The best reported BanglaMMedBench result is therefore 90.59% accuracy from openai/gpt-oss-120b in the zero-shot setting (Sultana et al., 6 Nov 2025).
These numbers support two conclusions stated by the paper. First, strong models can perform well on Bangla biomedical reasoning, but Bangla remains more difficult than English. Second, web search does not consistently improve BanglaMMedBench and, for the listed models, in fact reduces accuracy relative to zero-shot. The paper interprets this as evidence that scenario-based clinical questions are not easily solved by naive retrieval from the open web. Retrieval quality and source–task match matter more than retrieval by itself.
This point is methodologically important because the surrounding study is framed around RAG. The most advanced pipeline, Agentic RAG, dynamically chooses among local textbook retrieval, web retrieval, and zero-shot fallback according to a router policy with thresholds characters for textbook retrieval and characters for web summaries. Yet the BanglaMMedBench comparison indicates that, for clinically oriented MCQs, strong parametric reasoning can outperform naive web retrieval. A common misconception is therefore that any retrieval augmentation will improve low-resource biomedical QA; the reported results do not support that generalization.
6. Position within Bangla benchmarking research
BanglaMMedBench sits within a rapidly expanding but heterogeneous Bangla benchmarking landscape. BLUB, introduced with BanglaBERT, is the first unified Bangla Language Understanding Benchmark and covers sentiment classification, natural language inference, named entity recognition, and extractive question answering (Bhattacharjee et al., 2021). BenLLM-Eval is a separate zero-shot LLM evaluation suite covering 7 Bengali NLP tasks across 8 benchmark datasets, including summarization, QA, paraphrasing, NLI, transliteration, text classification, and sentiment analysis (Kabir et al., 2023). TituLLMs contributes a five-dataset Bangla benchmark suite oriented toward world knowledge, commonsense reasoning, and reading comprehension, including Bangla MMLU, BoolQ Bangla, CommonsenseQA Bangla, OpenBookQA Bangla, and PIQA Bangla (Nahin et al., 16 Feb 2025).
Within medical AI, BanglaMedVQA is distinct again: it is introduced as the first Bangla medical visual question answering benchmark, with 2,000 unique Bangla medical VQA instances, a 500-sample validation subset reviewed by two compensated medical specialists, an average acceptance rate of 97%, and Cohen’s (Ahmed et al., 18 May 2026). That resource targets multimodal clinical image understanding, anatomical localization, and diagnostic reasoning, whereas BanglaMMedBench in (Sultana et al., 6 Nov 2025) targets text-based biomedical MCQ reasoning, and BanglaMMedBench in (Bhowmik et al., 31 Jul 2025) targets multilingual transfer across translated general NLP benchmarks.
The comparison clarifies what BanglaMMedBench contributes and what it does not. It is not the first Bangla benchmark in general, because BLUB and BenLLM-Eval predate it in broader Bangla NLP. It is also not the only Bangla medical benchmark, because BanglaMedVQA addresses medical visual QA. Rather, the name marks two specialized developments: one focused on general multilingual Bengali LLM evaluation through translated benchmark suites, and one focused on scenario-based biomedical MCQ reasoning and retrieval evaluation. This suggests that Bangla benchmark design is fragmenting productively by capability type—general NLU, zero-shot generative NLP, broad LLM reasoning, biomedical MCQ reasoning, and multimodal medical QA—while still sharing a common underlying concern with low-resource language evaluation.
A final technical implication follows from the two BanglaMMedBench usages taken together. Both emphasize that Bengali evaluation quality depends not only on model scale, but also on benchmark construction choices: translation fidelity, normalization, response-format constraints, and language-specific representational issues such as tokenization efficiency. In that sense, BanglaMMedBench is less a single benchmark artifact than a recurring design response to the same research problem: how to measure Bengali model capability in settings where English benchmarks and English-centric infrastructure are insufficient.