BLUB: Bangla Language Understanding Benchmark
- BLUB is a comprehensive benchmark that standardizes evaluation for factual, procedural, and reasoning tasks in Bangla NLP.
- It consolidates native and translated datasets from educational, competitive, and web-sourced materials using manual curation, OCR, and EST pipelines.
- Empirical analyses reveal clear performance differences between proprietary and open-source models, guiding future enhancements in Bangla language research.
The Bangla Language Understanding Benchmark (BLUB) is a comprehensive suite of datasets, tasks, and evaluation protocols constructed to rigorously measure the performance of LLMs in Bengali. BLUB was developed in response to the scarcity of structured, high-coverage benchmarks for Bengali, now the seventh most spoken language globally. As LLMs proliferate across multilingual domains, BLUB serves a pivotal role in Bengali NLP research by systematizing evaluation for factual recall, procedural application, reasoning, world knowledge, commonsense inference, and reading comprehension. BLUB consolidates native and translated datasets, mirroring the rigor of benchmarks like MMLU in English, and is now the public standard for zero-shot and few-shot benchmarking of both proprietary and open-source Bangla LLMs (Joy, 25 May 2025, Nahin et al., 16 Feb 2025, Bhattacharjee et al., 2021).
1. Dataset Composition and Taxonomy
BLUB unifies datasets across major cognitive and academic categories, constructed via both manual curation and translation pipelines. Three primary BLUB variants are documented:
- BLUB for MMLU-style Evaluation (Joy, 25 May 2025):
- Comprises 138,949 question-option pairs across 23 domains reflective of undergraduate curricula and competitive exams in Bangladesh.
- Domains are subdivided into four super-categories: STEM (Advanced Mathematics, Advanced Physics, Chemistry, Biology, ICT, etc.), Humanities (Bengali Language and Literature, Logic, Religion), Social Sciences (Economics, Accounting, Finance, Civics, etc.), and General Knowledge.
- Data sources include NCTB textbooks, web scraping, and OCR from physical exam materials.
- TituLLMs BLUB Suite (Nahin et al., 16 Feb 2025):
- Consists of five core datasets totaling ~132,000 examples:
- Bangla MMLU (87,694 MCQs)—manually constructed world-knowledge questions.
- BoolQ BN—GPT-4-generated yes/no reading comprehension examples from Wikipedia and news.
- CommonsenseQA BN, OpenBookQA BN, PIQA BN—translations of English benchmarks using Expressive Semantic Translation (EST) pipelines.
- Table summarizing dataset scales and sources:
Dataset Method Train Validation Test Total Bangla MMLU Manual - 72,944 14,750 87,694 BoolQ BN GPT-4 generation 815 432 729 1,976 CommonsenseQA BN EST translation 9,741 1,221 - 10,962 OpenBookQA BN EST translation 4,947 500 497 5,944 PIQA BN EST translation 15,339 1,838 - 17,177 BanglaBERT BLUB (Bhattacharjee et al., 2021):
- Four core NLP tasks: Sentiment Classification (SentNoB), Natural Language Inference (BNLI), Named Entity Recognition (MultiCoNER Bangla), and Question Answering (Bangla QA + TyDiQA Bangla).
- Datasets range from tens of thousands to hundreds of thousands of examples, constructed via native annotation and machine/human translation.
2. Cognitive Categorization and Annotation Protocols
BLUB advances beyond surface-level evaluation by explicitly labeling the cognitive demands of each question (Joy, 25 May 2025):
- Cognitive Categories:
- Factual Knowledge—direct lookup or recall.
- Procedural/Application—single-step calculation or algorithmic application.
- Reasoning—multi-step inference, logical deduction, scenario analysis.
Test set questions are triple-annotated by trained undergraduates using a "decision ladder" approach: fact lookup ⇒ factual; direct procedural application ⇒ procedural; else ⇒ reasoning. Substantial inter-annotator agreement is quantified by Fleiss’ κ, with , and final category proportions of 60% factual, 18% procedural, and 22% reasoning on test data.
3. Evaluation Protocols and Metrics
All BLUB tasks are designed for standardized, zero-shot (and optionally few-shot) multiple-choice or classification evaluation:
- Prompting: Uniform system prompts require models to output only the correct answer choice.
- Metric: Accuracy is the primary metric, defined as
where is the number of test examples, the predicted class, and the gold label.
- Additional Task Metrics (Bhattacharjee et al., 2021):
- Macro-F1 for sentiment classification, Micro-F1 for NER, Exact-Match (EM) and token-level F1 for span-based QA.
- Metric formulas strictly follow standard conventions from machine learning and NLP literature.
Model selection spans proprietary LLMs (Gemini 2.0 Flash, GPT-4o, Claude 3.5 Haiku/Sonnet) and open-source releases (Llama 3.1/3.3, Gemma 2-9b/27b, BanglaBERT, TituLLM 1B/3B).
4. Empirical Findings and Analysis
Recent benchmarking reveals distinct performance strata between proprietary and open-source systems and across cognitive/subject domains (Joy, 25 May 2025, Nahin et al., 16 Feb 2025):
- Model Performance: Proprietary models (Gemini 2.0 Flash) lead with up to 0.758 accuracy overall; open-source models (Llama 3.3-70b) reach 0.593.
- Cognitive Breakdown: Factual tasks are easiest (top model accuracy ~77%), procedural tasks moderately challenging, and reasoning tasks hardest (open-source models ~58–59%).
- Domain Performance: STEM is most tractable for proprietary models (Gemini 2.0 Flash: 78.9%), but open-source LLMs lag here. Humanities and General Knowledge domains exhibit narrower performance gaps.
- Error Trends: Longer questions increase model error rates, especially in smaller models.
- Subject Consistency: Certain domains (ICT, Religion & Moral Education) are consistently easy, while Advanced/General Mathematics are both difficult and inconsistent across models.
TituLLM BLUB results indicate that scaled tokenizers and training can enhance commonsense QA performance, though models plateau on world-knowledge (MMLU) and reading comprehension (BoolQ BN), suggesting tokenization and corpus size influence performance ceilings (Nahin et al., 16 Feb 2025).
5. Data Sources and Corpus Construction
BLUB's corpus quality is ensured via:
- OCR and manual annotation of exam materials, competitive exam guides, NCTB textbooks, and curated web content.
- Machine translation pipelines (EST) for English-to-Bangla translation.
- Human annotation for reading comprehension (BoolQ BN) and critical review for translated QA datasets.
- Deduplication, language filtering, and tokenization adapted to Bengali’s morphological richness and code-mixed text.
For pretraining, "Bangla2B+" aggregates 27.5 GB of Bangla from 110 websites, yielding 2.18 billion tokens for BanglaBERT and related models (Bhattacharjee et al., 2021).
6. Model Architectures and Training Regimes
BLUB facilitates fair evaluation of Bengla-centric models:
- BanglaBERT (Bhattacharjee et al., 2021): ELECTRA-Base transformer, 12 layers, ~110M parameters, trained on "Bangla2B+" corpus with replaced-token detection and generator-discriminator objectives.
- TituLLMs (Nahin et al., 16 Feb 2025): 1B/3B parameter LLMs trained on ~37B tokens, with extended Llama-3.2 tokenizer for linguistic and cultural coverage.
- Benchmarking Protocols: All models are evaluated in zero- and few-shot setups; hyperparameters for ELECTRA and Llama-based models adhere to established standards (e.g., learning rate, batch size, warmup steps).
Efficiency analyses show BanglaBERT surpasses larger multilingual models (XLM-R large) in sample efficiency and computational cost.
7. Recommendations and Future Expansion
BLUB authors propose several strategies for improving Bengali LLM performance:
- Expand pretraining datasets with domain-specific (STEM-focused) Bengali corpora to enhance factual and procedural coverage.
- Augment underrepresented domains with synthetic MCQs via back-translation or model-aided generation.
- Employ chain-of-thought prompting and contrastive fine-tuning to boost reasoning skills, referencing efficacy from prior BEnQA and CMMLU studies (Joy, 25 May 2025).
- Future BLUB evolutions may incorporate open-ended QA, coreference, sentiment, dialogue, and adversarial splits to probe robustness and generalization (Nahin et al., 16 Feb 2025).
BLUB, with its open-source datasets, code, and leaderboards, serves as the nucleus for advancing authentic, high-quality Bengali LLM research and community evaluation.