BengaliMoralBench: Ethics Benchmark for LLMs

Updated 5 April 2026

BengaliMoralBench is a large-scale ethics benchmark that evaluates LLMs using 3,000 culturally contextualized Bengali moral scenarios.
It organizes assessments across five socio-cultural domains with balanced ethical and unethical instances to capture local moral norms.
The evaluation employs a triadic framework—Virtue, Commonsense, and Justice—revealing performance gaps among models in zero-shot settings.

BengaliMoralBench is a large-scale ethics benchmark explicitly designed to evaluate and audit the moral reasoning capabilities of multilingual LLMs within the context of Bengali language and culture. Addressing a major gap in AI ethics research, where most benchmarks rely on English-language and Western-centric moral frameworks, BengaliMoralBench introduces culturally specific scenarios and ethical lenses to assess model alignment with the nuanced socio-cultural realities of over 285 million Bengali speakers (Ridoy et al., 5 Nov 2025).

1. Corpus Design and Scope

BengaliMoralBench comprises 3,000 curated single-sentence scenarios representing moral judgments, with a balanced dataset of 1,500 ethical and 1,500 unethical instances. The corpus is systematically structured across five core domains of Bengali socio-cultural life—Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities—each containing 10 culturally grounded subtopics, for a total of 50 subdomains.

Domain and Instance Distribution

Domain	Subtopics	Instances/domain (total)
Daily Activities	10	600
Habits	10	600
Parenting	10	600
Family Relationships	10	600
Religious Activities	10	600

Each subtopic (e.g., Bazar run, Rickshaw commute etiquette, Right-vs-left-hand use, Dowry negotiations, Daily salat in workplace) is represented by 20 items (10 ethical, 10 unethical), explicitly capturing scenarios recognized as salient to Bengali communal, familial, and religious experience. This ensures broad coverage of everyday moral decision-making as shaped by regional customs and religious diversity.

2. Annotation Methodology and Ethical Lenses

Annotation is conducted by 30 native-level Bengali speakers, each with at least 10 years of continuous residence in Bangladesh. Every scenario is labeled through consensus according to one of three distinct ethical frameworks—the triadic lens approach—adapted to the local context:

Virtue Ethics: Assessment focused on moral character traits (satyata: honesty, daya: compassion, shraddha: respect), with annotation indicating whether the behavior expresses a culturally valued virtue.
Commonsense Ethics: Labeling based on intuitive community norms ("samajik gyan"), emphasizing pragmatism, harmony, hospitality, and respect for hierarchical relationships.
Justice Ethics: Application of fairness, equity, and rights (nyāy, samatā, adhikār), measuring impartiality and protection of vulnerable groups.

Initial calibration (pilot kappa = 0.61) was improved via iterative workshops (final kappa = 0.87). Each annotator contributed 1,000 items per lens, with peer review and senior adjudication phases; 3.1% of ambiguous instances were excluded after adjudication.

3. Prompting and Evaluation Procedures

BengaliMoralBench employs a rigorous zero-shot evaluation protocol, leveraging unified prompt templates in Bengali with parallel English versions to standardize LLM input. Models are instructed to output a binary classification—'1' (ethical) or '0' (unethical)—for a single-sentence scenario grounded in one ethical lens. Experimental temperature values of 0.3 and 0.7 are used to probe generative consistency.

Evaluation Metrics

Performance is quantified using standard classification metrics:

Accuracy:

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat y_i = y_i]$

Precision, Recall, F1-score:

$\mathrm{Precision} = \frac{TP}{TP+FP} \quad \mathrm{Recall} = \frac{TP}{TP+FN}$

$F_1 = 2\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Matthews Correlation Coefficient (MCC):

$\mathrm{MCC} = \frac{TP\cdot TN - FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

Cohen’s Kappa:

$\kappa = \frac{p_o - p_e}{1-p_e}$

where $p_o$ is observed agreement and $p_e$ is expected agreement by chance.

4. Model Performance and Quantitative Findings

BengaliMoralBench reveals wide variance among zero-shot model performances (accuracy from 50% to 91%, with 50% as random baseline). Notably, larger and more recent multilingual LLMs outperform smaller counterparts:

Gemma 2 (9B) achieves 91.2% accuracy and MCC 0.8242 on Commonsense, 80.36% and 0.6513 on Justice, 89.7% and 0.7947 on Virtue; Cohen's $\kappa$ ≈ 0.82 (Commonsense).
Qwen 2.5 (14B) reaches 89.3% (Commonsense), 86.29% (Justice, MCC 0.7391), 89.4% (Virtue).
Weaker models (Llama 3.2 1B, Gemma 3 1B) perform only marginally above chance (50–62% accuracy, MCC ≈ 0.03–0.29).

Domain-specific breakdown (Gemma 2 9B average):

Domain	Accuracy (%)	F1-score (%)
Daily Activities	90.6	89.8
Family Relationships	85.9	83.6
Habits	90.6	89.8
Parenting	90.2	89.6
Religious Activities	92.3	91.7

Gemma 2 9B demonstrates high temperature robustness ( $\sigma$ (F1) < 0.5), while Llama variants show greater output instability, particularly on Virtue items (up to $\sigma$ (F1) ≈ 1.7).

5. Qualitative Failure Analysis and Error Taxonomy

Systematic qualitative evaluation identifies persistent LLM shortcomings:

Commonsense Reasoning: Models incorrectly label altruistic acts (e.g., a child enduring hardship for elders during Ramadan) as morally neutral.
Justice: Tendency to validate gender-biased practices (e.g., prioritizing sons’ weddings) as ethical, consistent with inherited social hierarchies.
Virtue: Fails to map context-sensitive virtue signals, such as the cultural importance of removing shoes before entering a relative’s home.
Religious Sensitivity: Superficial judgments of acts like Qurbani meat distribution, lacking recognition of charitable and spiritual connotations.

Underlying factors include reliance on Western-centric pretraining, overfitting to lexical cues, limited exposure to Bengali religious/ritual texts, propagation of entrenched social biases, and poor transfer of moral abstraction across domains. Proposed mitigations focus on culturally grounded pretraining, advanced moral prompting, bias counteraction, modular reasoning submodules, and human-in-the-loop oversight.

6. Societal Impact and Future Trajectories

BengaliMoralBench delivers a reproducible, native-speaker-validated framework for aligning LLMs with contextually appropriate Bengali moral norms, supporting ethical AI deployment in multilingual and low-resource settings. Strategic priorities for the research community include:

Expanding pretraining with South Asian moral narratives from folklore, religious texts, and oral histories.
Incorporating moral context (family roles, religio-social hierarchy) into prompt design.
Data augmentation and fairness-aware fine-tuning to counteract cultural and gender biases.
Advancing LLM architecture toward modular "Commonsense," "Justice," and "Virtue" submodules (Editor's term).
Continuous, human-in-the-loop ethical monitoring to adapt benchmarks as societal values evolve.
Benchmark expansion to West Bengal dialects, inclusion of workplace/civic ethics domains, and multi-label moral annotation schemes.

The benchmark establishes a comprehensive diagnostic tool for evaluating LLMs’ capacity for culturally nuanced moral reasoning, surfacing limitations of current models and enabling targeted improvement for ethically robust, regionally adapted AI systems (Ridoy et al., 5 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BengaliMoralBench.

BengaliMoralBench: Ethics Benchmark for LLMs

1. Corpus Design and Scope

Domain and Instance Distribution

2. Annotation Methodology and Ethical Lenses

3. Prompting and Evaluation Procedures

Evaluation Metrics

4. Model Performance and Quantitative Findings

5. Qualitative Failure Analysis and Error Taxonomy

6. Societal Impact and Future Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BengaliMoralBench: Ethics Benchmark for LLMs

1. Corpus Design and Scope

Domain and Instance Distribution

2. Annotation Methodology and Ethical Lenses

3. Prompting and Evaluation Procedures

Evaluation Metrics

4. Model Performance and Quantitative Findings

5. Qualitative Failure Analysis and Error Taxonomy

6. Societal Impact and Future Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research