Moral Dilemma Dataset (MDD) Overview
- MDD is a distributional, pluralism-oriented benchmark of 1,618 authentic moral dilemmas annotated with diverse, real-world human judgments.
- It employs a rigorous methodology by decontextualizing Reddit scenarios, applying binary labels with free-text rationales, and utilizing a 60-value taxonomy.
- Results indicate that LLMs often default to mode-seeking and narrow value repertoires, highlighting gaps in capturing human ethical pluralism.
The Moral Dilemma Dataset (MDD) is a distributional, pluralism-oriented benchmark designed to evaluate the alignment between LLMs and the spectrum of human moral judgment in authentic, context-rich scenarios. Unlike prior datasets focused on stylized, synthetic dilemmas or single-label crowd judgments, MDD combines real-world moral ambiguity, granular distributional human annotations, and rigorously engineered value taxonomy to probe both the outcomes and rationales of LLM moral reasoning.
1. Composition, Source, and Construction
The MDD comprises 1,618 true-to-life moral dilemmas sourced from the r/AmITheAsshole (AITA) subreddit over a six-month period, representing diverse domains such as family conflict, workplace ethics, care obligations, and interpersonal disputes. Each scenario is fully decontextualized—rewritten (by GPT-4o-mini) to remove distinguishing AITA formatting or signals—thereby capturing an ecologically valid and hard-to-spoof distribution of real dilemmas.
For each scenario, both a succinct title and a detailed narrative body are provided, along with manually assigned topic labels (family, work, health, etc.) and available demographic cues. Scenarios were selected for inclusion only if they could not be trivially identified as originating from AITA, ensuring general-purpose utility.
2. Human Annotation and Judgment Distribution
Each dilemma is annotated with the complete set of direct community moral judgments: 51,776 binary human evaluations (Acceptable/Unacceptable) and their accompanying free-text rationales, extracted from Reddit comments. To enforce comparability, only majority-verdict (“NTA” or “YTA”) comments with an explicit binary stance were retained; ambiguous or non-binary (e.g., ESH, NAH, INFO) replies were excluded.
Unlike prior “consensus-first” datasets, MDD preserves the entire pluralistic distribution of judgments for every scenario, not just the modal outcome:
| Dilemma | # Human Votes | % Acceptable | % Unacceptable | Consensus Score |
|---|---|---|---|---|
| Example A | 90 | 0.74 | 0.26 | 0.74 |
| Example B | 18 | 0.56 | 0.44 | 0.56 |
Consensus score (modal class proportion, 0.5 = maximal disagreement, 1.0 = unanimous) is recorded per dilemma. This allows for detailed stratification of “clean-cut” vs. “contentious” cases during model evaluation.
Each human judgment consists of:
- A binary label: 1 (Acceptable) or 0 (Unacceptable).
- A free-text rationale (mean length: 28.9 tokens, stdev: 11.7), providing the explicit moral reasoning behind the choice.
3. Taxonomy of Moral Values and Value Extraction
All rationales (human and model-generated) are processed with the Value Kaleidoscope system, which applies LLM-powered classification (validated by prior work) to extract explicit value terms from moral rationales. The full procedure yields:
- 3,783 unique value expressions, mapped via semantic embedding (OpenAI
text-embedding-3-large) and agglomerative clustering. - Manual expert curation (5 reviewers) produces a 60-value taxonomy encompassing Autonomy, Beneficence, Care, Compassion, Justice, Inclusivity, Freedom, and 53 additional fine-grained categories.
- Each rationale yields a value profile—the set of values referenced, frequency-normalized.
For entropy analysis, value diversity per response set is calculated as:
where is the frequency of value in either human or model rationales.
4. Benchmarking LLMs: Protocol
MDD is structured for benchmarking in a distributional and rationalist fashion. Each LLM is tasked to, for every dilemma:
- Generate as many binary evaluation+rationale pairs as the number of human judgments for that case (replicating sample size and diversity).
- Rationale generation is evaluated both for surface-level label agreement and for value diversity and distributional alignment.
Prompting Regimes:
- Zero-shot: LLMs respond without demographic or persona cues.
- Persona-based: Demographic sampling to match community diversity.
- Model council: Aggregated responses from multiple LLMs acting as a panel.
Distributional Alignment Metric:
For each dilemma with human judgments,
where is the empirical proportion of “Acceptable” in human judgments, and is the corresponding model proportion. Mean is reported across all dilemmas; lower is better.
5. Major Findings on Model-Human Alignment
MDD establishes two orthogonal axes of evaluation:
- Distributional Judgment Alignment: LLMs reproduce human consensus well only when it is high; in ambiguous scenarios with discordant human judgments, LLMs default to mode-seeking, resulting in poor distributional alignment even for best-in-class models (council ). Standard prompting does not capture humanlike pluralism in ambiguous cases.
- Value Diversity Gap: Model rationales rely on a much narrower palette of moral values. Human rationales produce an entropy , with the top 10 values accounting for only 35.2% of value mentions. LLMs, by contrast, have entropy (standard regime), top 10 covering 81.6%—focusing on Autonomy, Care, and similar platitudes, rarely reflecting marginalized or situational values prominent in human justification.
A key intervention, Dynamic Moral Profiling (DMP)—conditioning model responses on sampled human value-profiles from a Dirichlet prior—boosts both alignment ( mean error, especially in low-consensus cases) and entropy (), yielding greater coverage of mid- and low-frequency values.
6. Methodological Advantages and Applications
MDD enables pluralism-sensitive, real-world grounded benchmarking for LLMs, providing:
- Distributional direct comparison metrics for models’ ability to reflect true human heterogeneity.
- Rationalist value-profile assessment, enabling analysis of which values are being invoked or neglected by LLMs in their justification.
- Ecologically valid test bed for interventions such as DMP or theory-driven prompting (e.g., Moral Foundations Theory filtering).
MDD is well-positioned for diagnosing risks of monist/majoritarian alignment, testing LLMs intended for advisory roles, and for evaluating the effectiveness of context- or value-conditioning techniques at surfacing humanlike diversity in automated ethical advice.
7. Comparison and Positioning within Moral Judgment Datasets
Compared to prior datasets:
| Dataset | Scale | Dilemma Source | Label Distribution | Value Taxonomy | Free-text Rationales |
|---|---|---|---|---|---|
| MDD (this work) | 1.6k | Decontextualized real world | Full, per-dilemma | 60-values | Yes |
| Scruples | 32k+ | Reddit-AITA | Full, multiclass | No | No |
| Moral Machine | 40M | Synthetic AV | Aggregate AMCE | None | No |
| ETHICS | 130k | Synthetic/fictional | Binary | Theory-mapped | No |
| UniMoral | 194–294/scenario x 6 lang | Psych + Reddit | Per-annotator | 4 Principle types | Yes |
MDD’s design is unique for capturing both the judgment distribution and the diversity of rationales as produced by naturalistic communities, operationalizing value pluralism as a concrete evaluation axis (Russo et al., 23 Jul 2025).
8. Implications and Future Directions
The MDD reveals that distributional human alignment and value pluralism are not realized by default in even the strongest LLMs, underscoring the need for explicit, data-driven conditioning if LLMs are to play credible roles in sensitive, ethically-charged contexts. Pluralistic distributional benchmarks, such as MDD, provide a more stringent and informative test of moral alignment than single-label or stylized datasets.
The data and methodology demonstrate the inadequacy of majority-vote metrics for capturing human-moral complexity and provide a scalable, extensible foundation for future research on value alignment, model-centric pluralism interventions, and pluralistic explainability in AI moral reasoning.
References:
- "The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and LLMs" (Russo et al., 23 Jul 2025)
- "Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes" (Lourie et al., 2020)
- "Mapping Topics in 100,000 Real-life Moral Dilemmas" (Nguyen et al., 2022)