BengaliMoralBench: Ethics Benchmark for LLMs
- BengaliMoralBench is a large-scale ethics benchmark that evaluates LLMs using 3,000 culturally contextualized Bengali moral scenarios.
- It organizes assessments across five socio-cultural domains with balanced ethical and unethical instances to capture local moral norms.
- The evaluation employs a triadic framework—Virtue, Commonsense, and Justice—revealing performance gaps among models in zero-shot settings.
BengaliMoralBench is a large-scale ethics benchmark explicitly designed to evaluate and audit the moral reasoning capabilities of multilingual LLMs within the context of Bengali language and culture. Addressing a major gap in AI ethics research, where most benchmarks rely on English-language and Western-centric moral frameworks, BengaliMoralBench introduces culturally specific scenarios and ethical lenses to assess model alignment with the nuanced socio-cultural realities of over 285 million Bengali speakers (Ridoy et al., 5 Nov 2025).
1. Corpus Design and Scope
BengaliMoralBench comprises 3,000 curated single-sentence scenarios representing moral judgments, with a balanced dataset of 1,500 ethical and 1,500 unethical instances. The corpus is systematically structured across five core domains of Bengali socio-cultural life—Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities—each containing 10 culturally grounded subtopics, for a total of 50 subdomains.
Domain and Instance Distribution
| Domain | Subtopics | Instances/domain (total) |
|---|---|---|
| Daily Activities | 10 | 600 |
| Habits | 10 | 600 |
| Parenting | 10 | 600 |
| Family Relationships | 10 | 600 |
| Religious Activities | 10 | 600 |
Each subtopic (e.g., Bazar run, Rickshaw commute etiquette, Right-vs-left-hand use, Dowry negotiations, Daily salat in workplace) is represented by 20 items (10 ethical, 10 unethical), explicitly capturing scenarios recognized as salient to Bengali communal, familial, and religious experience. This ensures broad coverage of everyday moral decision-making as shaped by regional customs and religious diversity.
2. Annotation Methodology and Ethical Lenses
Annotation is conducted by 30 native-level Bengali speakers, each with at least 10 years of continuous residence in Bangladesh. Every scenario is labeled through consensus according to one of three distinct ethical frameworks—the triadic lens approach—adapted to the local context:
- Virtue Ethics: Assessment focused on moral character traits (satyata: honesty, daya: compassion, shraddha: respect), with annotation indicating whether the behavior expresses a culturally valued virtue.
- Commonsense Ethics: Labeling based on intuitive community norms ("samajik gyan"), emphasizing pragmatism, harmony, hospitality, and respect for hierarchical relationships.
- Justice Ethics: Application of fairness, equity, and rights (nyāy, samatā, adhikār), measuring impartiality and protection of vulnerable groups.
Initial calibration (pilot kappa = 0.61) was improved via iterative workshops (final kappa = 0.87). Each annotator contributed 1,000 items per lens, with peer review and senior adjudication phases; 3.1% of ambiguous instances were excluded after adjudication.
3. Prompting and Evaluation Procedures
BengaliMoralBench employs a rigorous zero-shot evaluation protocol, leveraging unified prompt templates in Bengali with parallel English versions to standardize LLM input. Models are instructed to output a binary classification—'1' (ethical) or '0' (unethical)—for a single-sentence scenario grounded in one ethical lens. Experimental temperature values of 0.3 and 0.7 are used to probe generative consistency.
Evaluation Metrics
Performance is quantified using standard classification metrics:
- Accuracy:
- Precision, Recall, F1-score:
- Cohen’s Kappa:
where is observed agreement and is expected agreement by chance.
4. Model Performance and Quantitative Findings
BengaliMoralBench reveals wide variance among zero-shot model performances (accuracy from 50% to 91%, with 50% as random baseline). Notably, larger and more recent multilingual LLMs outperform smaller counterparts:
- Gemma 2 (9B) achieves 91.2% accuracy and MCC 0.8242 on Commonsense, 80.36% and 0.6513 on Justice, 89.7% and 0.7947 on Virtue; Cohen's ≈ 0.82 (Commonsense).
- Qwen 2.5 (14B) reaches 89.3% (Commonsense), 86.29% (Justice, MCC 0.7391), 89.4% (Virtue).
- Weaker models (Llama 3.2 1B, Gemma 3 1B) perform only marginally above chance (50–62% accuracy, MCC ≈ 0.03–0.29).
Domain-specific breakdown (Gemma 2 9B average):
| Domain | Accuracy (%) | F1-score (%) |
|---|---|---|
| Daily Activities | 90.6 | 89.8 |
| Family Relationships | 85.9 | 83.6 |
| Habits | 90.6 | 89.8 |
| Parenting | 90.2 | 89.6 |
| Religious Activities | 92.3 | 91.7 |
Gemma 2 9B demonstrates high temperature robustness ((F1) < 0.5), while Llama variants show greater output instability, particularly on Virtue items (up to (F1) ≈ 1.7).
5. Qualitative Failure Analysis and Error Taxonomy
Systematic qualitative evaluation identifies persistent LLM shortcomings:
- Commonsense Reasoning: Models incorrectly label altruistic acts (e.g., a child enduring hardship for elders during Ramadan) as morally neutral.
- Justice: Tendency to validate gender-biased practices (e.g., prioritizing sons’ weddings) as ethical, consistent with inherited social hierarchies.
- Virtue: Fails to map context-sensitive virtue signals, such as the cultural importance of removing shoes before entering a relative’s home.
- Religious Sensitivity: Superficial judgments of acts like Qurbani meat distribution, lacking recognition of charitable and spiritual connotations.
Underlying factors include reliance on Western-centric pretraining, overfitting to lexical cues, limited exposure to Bengali religious/ritual texts, propagation of entrenched social biases, and poor transfer of moral abstraction across domains. Proposed mitigations focus on culturally grounded pretraining, advanced moral prompting, bias counteraction, modular reasoning submodules, and human-in-the-loop oversight.
6. Societal Impact and Future Trajectories
BengaliMoralBench delivers a reproducible, native-speaker-validated framework for aligning LLMs with contextually appropriate Bengali moral norms, supporting ethical AI deployment in multilingual and low-resource settings. Strategic priorities for the research community include:
- Expanding pretraining with South Asian moral narratives from folklore, religious texts, and oral histories.
- Incorporating moral context (family roles, religio-social hierarchy) into prompt design.
- Data augmentation and fairness-aware fine-tuning to counteract cultural and gender biases.
- Advancing LLM architecture toward modular "Commonsense," "Justice," and "Virtue" submodules (Editor's term).
- Continuous, human-in-the-loop ethical monitoring to adapt benchmarks as societal values evolve.
- Benchmark expansion to West Bengal dialects, inclusion of workplace/civic ethics domains, and multi-label moral annotation schemes.
The benchmark establishes a comprehensive diagnostic tool for evaluating LLMs’ capacity for culturally nuanced moral reasoning, surfacing limitations of current models and enabling targeted improvement for ethically robust, regionally adapted AI systems (Ridoy et al., 5 Nov 2025).