MoReBench: AI Moral Reasoning Benchmark
- MoReBench is a specialized benchmark for evaluating language models’ moral reasoning with process-focused, expert-defined rubrics.
- The benchmark uses 1,000 scenarios annotated with over 23,000 weighted criteria to assess reasoning processes rather than just outcomes.
- It includes a theory-stratified subset evaluating five normative ethical frameworks to reveal model strengths and potential biases.
MoReBench is a specialized benchmark for the evaluation of procedural and pluralistic moral reasoning capacities in LLMs. Its central innovation lies in shifting the assessment paradigm from outcome-centric evaluation to a process-focused rubric that rigorously judges the reasoning trace produced by AI systems—dissecting how decisions are made rather than merely tallying correct answers. MoReBench comprises 1,000 diverse moral scenarios, each paired with dozens of atomic, expert-written rubric criteria designed to capture essential aspects of ethical deliberation, and an additional theory-structured subset (“MoReBench-Theory”) for evaluating proficiency under major normative ethical frameworks.
1. Conceptual Motivation and Benchmark Formulation
MoReBench was developed in recognition of fundamental limitations in conventional benchmarks for LLMs, which largely emphasize outcome accuracy (as in math or code tasks) and neglect the transparency and pluralism inherent to moral decision-making. Moral dilemmas—distinct from mathematical queries—do not admit single objectively correct answers but require models to surface all relevant considerations, weigh complex trade-offs, and demonstrate reasoning aligned with diverse human values. By pairing each scenario with a unique set of rubric criteria, MoReBench enables process-focused, granular evaluation of how well intermediate reasoning traces accord with ethical standards.
MoReBench’s dataset is sourced from three principal domains: everyday dilemmas (Moral Advisor), safety-critical or autonomy-relevant scenarios (Moral Agent), and canonical exemplars from philosophical ethics literature. Each scenario is annotated by experts with multiple criteria—spanning identification, logical integration, actionable recommendation, and harmlessness—ensuring coverage of all aspects deemed essential to sound moral reasoning. Rubric criteria are written to be objective and atomic, reducing subjectivity and facilitating automated evaluation via LLM-based judges.
2. Structure and Scoring Methodology
The full MoReBench dataset contains 1,000 scenarios and over 23,000 rubric criteria. For each instance, criteria are weighted (−3 to +3) to indicate their relative importance. The scoring function for a model’s response to scenario is given by
where is the number of criteria for scenario , is the assigned weight, and is a binary indicator of criterion fulfillment. This enables a normalized, criterion-weighted score per scenario, promoting granular and differentiated evaluation.
To control for verbosity bias (where longer responses might trivially satisfy more criteria), MoReBench introduces a “Length-Controlled Score.” This metric normalizes the scenario score with respect to reference response length (1000 characters), as detailed in Eq. (2) in the source paper. This ensures that scores reflect the efficiency and clarity of moral reasoning, not mere exhaustiveness.
Meta-evaluation includes stress testing the rubric’s power to discriminate between varying LLM outputs, robustness to LLM self-evaluation, and analysis of the effectiveness of LLM-based judges assigned to assess criterion fulfillment.
3. MoReBench-Theory: Evaluation Across Ethical Frameworks
To probe the pluralistic capabilities of LLMs, MoReBench comprises a dedicated subset, MoReBench-Theory, encompassing 150 scenarios each annotated with explicit guidance to apply one of five major normative ethical frameworks:
- Kantian Deontology
- Benthamite Act Utilitarianism
- Aristotelian Virtue Ethics
- Scanlonian Contractualism
- Gauthierian Contractarianism
Each scenario in this subset is annotated and scored on its alignment with the designated ethical theory, facilitating granular analysis of framework-specific reasoning. This reveals model strengths and weaknesses in adapting reasoning to diverse normative paradigms and exposes potential bias or overfitting towards frameworks prevalent in training or RLHF signals.
4. Major Findings and Technical Implications
Empirical results from MoReBench indicate that traditional predictors—such as scaling laws and metrics from math, code, or scientific reasoning tasks—fail to forecast LLM performance in procedural moral reasoning. Model scores vary widely: most excel at “harmless outcome” criteria (avoiding illegal or harmful recommendations), but comparatively underperform on “logical process” criteria that require systematic integration of competing moral factors.
On MoReBench-Theory, models are best at reasoning in utilitarian and deontological paradigms, with marked deficits in virtue ethics and contractarian reasoning—suggesting side effects from dominant training paradigms (notably RLHF with utilitarian/deontological alignment signals). This partiality underscores the importance of explicit pluralism and diagnostic evaluation for balanced model deployment in ethical domains.
A plausible implication is that LLMs, as currently trained, may inadvertently reinforce narrow normative assumptions unless process-focused pluralistic evaluation and alignment protocols are incorporated during development.
5. Rubric Design, Evaluation Dynamics, and Result Interpretation
Rubric construction is central to MoReBench’s methodology. Each scenario is assigned a tailored rubric covering all morally relevant aspects, with expert-defined weights constraining criterion redundancy. The rubric’s atomicity facilitates independent criterion fulfillment and objective LLM-based assessment.
MoReBench employs iterative meta-evaluation: performance of LLM-based judges, rubric robustness, and stress testing discriminatory power are all tracked. These layers of validation address common pitfalls in automated evaluation—such as verbosity bias or spurious criterion matching.
Score distributions across criteria, scenario types, and frameworks are visualized to identify systematic deficits, outlier cases, and domains where models fail to generalize. Weight normalization, length control, and fine-grained criterion scoring establish a nuanced benchmark landscape for diagnosing moral reasoning capacity, rather than merely tracking absolute performance.
6. Applications, Impact, and Prospective Directions
MoReBench opens a pathway for rigorous, transparent evaluation of the reasoning processes underpinning LLM-generated ethical advice or autonomous decision making. Its rubric-driven, pluralistic architecture supports finer analysis of alignment with human values and reveals domain-specific deficiencies that are opaque to traditional outcome-based benchmarks.
Applications span model diagnostics, safety-oriented alignment, curriculum design for RLHF or other training modalities, and the development of interpretability tooling for high-stakes deployments. In view of observed framework partiality, robust pluralistic evaluation and theory-diverse training paradigms are essential for avoiding monocultural drift in model ethical reasoning.
Future directions proposed include expansion of the rubric and scenario complexity, refinement of length and verbosity normalization, incorporation of alternative evaluation metrics, and ongoing refinement of process-focused scoring. The authors advocate for improved training and alignment methods—specifically tuned for moral reasoning—to better integrate diverse ethical considerations and produce models with comprehensive, balanced, and transparent moral deliberation capabilities.
7. Comparative Benchmarks and Unique Contributions
Unlike benchmarks for math, code, or general scientific reasoning, MoReBench is unique in its explicit procedural and pluralistic orientation. The framework assimilates criteria from domain experts, underpinned by weighted and atomic rubric annotations, and pairs these with sophisticated technical controls to ensure fair, efficient, and differentiated model assessment.
No other extant benchmark offers side-by-side evaluation of open- and closed-source models across 1,000 moral dilemmas, over 23,000 criteria, and with stratified testing under five distinct ethical frameworks. This positions MoReBench as a pivotal resource for next-generation research in AI safety, moral reasoning, and normative alignment (Chiu et al., 18 Oct 2025).
In summary, MoReBench advances the field of moral reasoning evaluation for LLMs by prioritizing process-centric, pluralism-sensitive assessment. Its rubric-based framework, theory-stratified evaluation, and precise scoring architecture illuminate critical dimensions of model moral proficiency—establishing both methodological precedents and practical tools for the transparent, safe, and balanced deployment of AI in ethically consequential domains.