Moral Foundations Questionnaire Overview

Updated 12 April 2026

The MFQ is a psychometric tool designed to quantify moral reasoning using structured Likert-scale items across fundamental dimensions.
It employs versions like MFQ-30 and MFQ-2 to measure dimensions such as Care, Fairness, Loyalty, Authority, and Purity with rigorous statistical methodologies.
Applied in both human and large language model studies, the MFQ aids in assessing moral robustness, cross-cultural differences, and alignment with human ethical benchmarks.

The Moral Foundations Questionnaire (MFQ) is a psychometric instrument developed to operationalize Moral Foundations Theory (MFT), which posits that human moral reasoning is structured around a small set of evolutionarily rooted “foundations.” The MFQ and its variants have enabled large-scale, quantitative measurement of individuals’ endorsement of these distinct foundational dimensions of morality. The MFQ underlies recent benchmarks for evaluating the moral judgment characteristics of both humans and LLMs, supporting comparative studies of reliability, cross-cultural bias, alignment, and susceptibility to contextual factors such as role-play or persona prompting (Costa et al., 11 Nov 2025, Aksoy, 2024, Münker, 14 Jul 2025, Münker, 2024, Nunes et al., 2024).

1. Instrument Structure and Foundation Taxonomies

The original MFQ measures five foundational dimensions: Harm/Care, Fairness/Cheating (sometimes labeled Fairness/Reciprocity), In-group/Loyalty, Authority/Respect, and Purity/Sanctity (also termed Degradation). Extended versions (e.g., MFQ-2) add Equality and Proportionality as distinct subdimensions of fairness, and occasionally Liberty/Oppression as a sixth (or seventh) foundation (Aksoy, 2024, Münker, 14 Jul 2025).

MFQ Version	Item Count	Foundations (f)	Items per f
MFQ (Classic)	30–32	Harm/Care, Fairness, Loyalty, Authority, Purity	6
MFQ-2	36	Care, Equality, Proportionality, Loyalty, Authority, Purity	6

Items are grouped into “relevance” (how much a consideration factors into right/wrong judgments) and “agreement” (degree of endorsement for value-laden statements) sections; each is rated on a bounded Likert-type scale. MFQ-30 and MFQ-32 use a 0–5 integer scale; MFQ-2 uses a 1–5 scale with labeled anchors from “Does not describe me at all” to “Describes me extremely well” (Costa et al., 11 Nov 2025, Aksoy, 2024, Münker, 2024, Nunes et al., 2024).

2. Scoring Methodology and Statistical Formalism

Each moral foundation’s subscore is calculated as a mean over its corresponding items. For MFQ-30:

$\text{Score}_{f} = \frac{1}{6} \sum_{i=1}^{6} r_{i}^{(f)}$

where $r_{i}^{(f)}$ is the integer response to item $i$ for foundation $f$ (Costa et al., 11 Nov 2025, Münker, 2024, Nunes et al., 2024). For MFQ-2, let $Q_f$ be the six items for foundation $f$ and $r_q \in \{1,\dots,5\}$ :

$S_f = \frac{1}{6} \sum_{q \in Q_f} r_q$

Group-level scores are averages over all group members, and cross-group comparison metrics include mean absolute difference (MAD), Jensen–Shannon divergence, and Cronbach’s $\alpha$ for internal reliability (Münker, 2024, Münker, 14 Jul 2025, Nunes et al., 2024).

Benchmarking in LLM studies often involves repeated sampling, persona-conditioned runs, and bootstrapped estimators for within- and across-condition variability (Costa et al., 11 Nov 2025). For example, in the “moral robustness” setting, with personas $\mathcal{P}$ , items $r_{i}^{(f)}$ 0, and $r_{i}^{(f)}$ 1 repeats per $r_{i}^{(f)}$ 2 pair:

$r_{i}^{(f)}$ 3

Aggregate model/population statistics are then computed over all $r_{i}^{(f)}$ 4, using derived metrics for robustness ( $r_{i}^{(f)}$ 5) and susceptibility ( $r_{i}^{(f)}$ 6), as detailed below.

3. Applications: LLM Benchmarking, Cross-Cultural Measurement, and Reliability

The MFQ is extensively leveraged in both human and machine studies to characterize patterns of foundational endorsement. In LLM research, persona prompting and zero-shot or in-context adaptation protocols are used to mimic or elicit “synthetic” MFQ responses under specified cultural, ideological, or demographic personas (Costa et al., 11 Nov 2025, Münker, 14 Jul 2025, Münker, 2024, Aksoy, 2024).

Empirical studies reveal:

Clear model-family effects on robustness (within-persona stability) versus susceptibility (cross-persona variability), with Claude and Gemini models being most robust, and larger models tending toward greater susceptibility (Costa et al., 11 Nov 2025).
Marked compression of response diversity compared to the breadth of observed human moral intuitions, especially in open-weight models, with smaller models (e.g., Qwen 2.5-7B) sometimes yielding closer human alignment than their larger counterparts (Münker, 14 Jul 2025).
Significant language and cultural variation, with English and WEIRD (Western, Educated, Industrial, Rich, Democratic) language prompts dominating model output distributions, even with cross-lingual MFQ-2 benchmarks (Aksoy, 2024).
Reliability as measured by Cronbach’s $r_{i}^{(f)}$ 7 is within or above typical human ranges for some models (e.g., GPT-4, MFQ overall $r_{i}^{(f)}$ 8; humans $r_{i}^{(f)}$ 9); however, “moral hypocrisy”—lack of correspondence between abstract MFQ scores and grounded scenario judgments—remains a persistent issue (Nunes et al., 2024).

4. Metrics: Moral Robustness, Susceptibility, and Alignment Assessment

The introduction of MFQ-based benchmarks for LLMs motivated the need for formal metrics quantifying behavioral stability and persona influence:

Moral Robustness ( $i$ 0): Quantifies the within-persona stability of MFQ responses. Computed as the inverse of the average within-persona standard deviation, then normalized to [0,1]:

$i$ 1

where $i$ 2, and $i$ 3 is the mean standard deviation of responses for all $i$ 4 (Costa et al., 11 Nov 2025).

Moral Susceptibility ( $i$ 5): Measures the across-persona variability of responses, quantifying the influence of persona conditioning:

$i$ 6

where $i$ 7 is the mean standard deviation across persona groups for each item (Costa et al., 11 Nov 2025).

Alignment to human data: Mean absolute difference ( $i$ 8), Jensen–Shannon divergence, and multi-foundation calibration indices quantify the distance between model and human population moral profiles (Münker, 14 Jul 2025, Münker, 2024).
Reliability/Consistency: Cronbach’s $i$ 9 is used to assess internal consistency; values comparable to or surpassing human norms have been reported for some LLMs (Nunes et al., 2024).

5. Cross-Cultural Evaluations and the Limits of Model Alignment

Recent comparative studies using MFQ and MFQ-2 highlight persistent challenges in cross-cultural alignment:

LLM-derived MFQ profiles often “flatten” cultural variability, compressing the natural spread of human scores on Loyalty, Authority, and Purity (Münker, 14 Jul 2025).
Model outputs are dominated by the statistical and linguistic properties of their training data—models trained primarily on English or other WEIRD language corpora demonstrate norm imposition even when prompted with culturally specific personas (Aksoy, 2024).
Model family and scale do not guarantee improved cultural fidelity; smaller models can outperform larger ones and vice versa, depending on training data and architecture (Münker, 14 Jul 2025, Münker, 2024).
Simple persona prompting for ideological simulation is insufficient to recapitulate real-world MFQ profiles; fine-grained, value-driven alignment frameworks, explicit context embeddings, and cross-lingual calibration are required for closer alignment (Münker, 2024).
Human-model coherence (the ability for MFQ scores to predict scenario-based moral judgments) is extremely low for current LLMs; declared foundations do not translate into consistent scenario evaluation, suggesting “moral hypocrisy” and a failure of deep alignment (Nunes et al., 2024).

6. Methodological and Practical Considerations

MFQ deployment across research contexts requires rigorous attention to survey design, scaling, translation, and statistical analysis:

Item translations for MFQ-2 must preserve not only linguistic but also cultural meaning, with reference to validated human datasets (Aksoy, 2024).
Benchmarking protocols for LLMs often require parsing and filtering to ensure valid integer or categorical response formats (Costa et al., 11 Nov 2025, Nunes et al., 2024).
Evaluations must account for potential biases arising from context prompts, output temperature, and “guardrails” imposed by RLHF or other alignment training (Münker, 2024).
Reliable alignment remains an ongoing challenge. Proposed strategies for improvement include: expanding culturally representative moral psychology datasets, designing objective functions for cross-population divergence minimization, prompt tuning with demographic context, and dynamic, scenario-based assessment frameworks (Münker, 14 Jul 2025, Aksoy, 2024, Münker, 2024).

In sum, the Moral Foundations Questionnaire and its revisions underpin a rapidly evolving methodology for moral psychometrics in both human and artificial agents. Its principled structure and diverse metric toolkit have catalyzed a wave of empirical studies probing reliability, robustness, susceptibility, alignment, and cultural sensitivity in next-generation LLMs, while highlighting the remaining frontier of deep, context-dependent moral understanding (Costa et al., 11 Nov 2025, Aksoy, 2024, Münker, 14 Jul 2025, Münker, 2024, Nunes et al., 2024).