Supervised Moral Rationale Attention (SMRA)

Updated 9 January 2026

The paper introduces SMRA, a framework that supervises model attention using expert-annotated moral rationales drawn from Moral Foundations Theory.
It extends Transformer-based classifiers by aligning token-level attention with culturally contextualized moral evidence, reducing reliance on spurious lexical cues.
Empirical evaluations on HateBRMoralXplain show significant improvements in classification metrics and explanation quality, with enhanced robustness and fairness.

Supervised Moral Rationale Attention (SMRA) is a framework for self-explaining hate speech detection that directly supervises model attention using expert-annotated moral rationales, shifting the focus of neural classifiers from spurious surface-level lexical cues to normatively salient linguistic evidence. By integrating annotations grounded in Moral Foundations Theory (MFT), SMRA generates model-intrinsic, culturally contextualized explanations and achieves improved robustness, interpretability, and cross-cultural generalization in both hate speech and moral sentiment classification tasks (Vargas et al., 7 Jan 2026).

1. Motivation and Theoretical Basis

Neural hate speech classifiers, while performant, historically rely on surface-level lexical features, leading to two principal limitations: (1) vulnerability to spurious correlations and poor generalization across cultural contexts, and (2) lack of robust, transparent model explanations. Post-hoc explanation techniques (e.g., gradient-based or surrogate rationales) are frequently unfaithful and computationally expensive at inference, while previous self-explaining models (such as Supervised Rational Attention, SRA) typically leverage lexical rationales that do not capture the normative, moral substrate of hate speech (Vargas et al., 7 Jan 2026).

Hate speech is fundamentally moral in character, often justified by appeals to moral sentiments such as purity, authority, or loyalty. Moral Foundations Theory provides a cross-culturally validated taxonomy of five core dimensions—Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation—each expressed as virtue or violation. SMRA leverages this theory by aligning model attention with human-annotated rationales grounded in these moral categories, aiming to produce more generalizable, culturally aware, and auditable detection systems (Vargas et al., 7 Jan 2026).

2. Model Architecture and Training Procedure

SMRA extends a Transformer-based text classifier (e.g., BERTimbau) by supervising token-level attention alignment with annotated moral rationales. Given a tokenized input $x = (w_1, \ldots, w_L)$ and contextual embeddings $\mathbf{h}_1, \ldots, \mathbf{h}_L \in \mathbb{R}^d$ , the [CLS] embedding $\mathbf{h}_{\text{[CLS]}}$ is passed through a linear-softmax head for hate or moral category classification.

The model extracts a normalized attention vector $\mathbf{a} = (a_1, \ldots, a_L)$ from the final encoder layer, typically referencing attention weights from the [CLS] token to each input token. Supervised moral rationale alignment is achieved via a token-wise cross-entropy loss: $\mathcal{L}_{\mathrm{align}} = -\sum_{i=1}^L \bigl[ r_i \log(a_i) + (1 - r_i) \log(1 - a_i) \bigr],$ where the binary rationale mask $r = (r_1, \ldots, r_L)$ is derived from expert-annotated, minimal text spans corresponding to moral justifications.

The final training objective combines cross-entropy classification loss $\mathcal{L}_{\mathrm{CE}}$ with the attention alignment loss: $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{CE}} + \lambda\mathcal{L}_{\mathrm{align}},$ with hyperparameter $\lambda$ (chosen $\approx 10^{-3}$ ), and $\mathbf{h}_1, \ldots, \mathbf{h}_L \in \mathbb{R}^d$ 0 applied selectively to samples labeled with at least one moral category (Vargas et al., 7 Jan 2026).

3. Moral Annotation Schema and Process

The SMRA approach uses six mutually exclusive labels built on the five MFT dimensions plus a “Non-Morality” (NN) category. The labels are:

Abbreviation	MFT Dimension (Virtue/Violation Collapsed)	Example Justification
HN	Care vs. Harm	“inflicting cruelty”
FN	Fairness vs. Cheating	“stealing rights”
LN	Loyalty vs. Betrayal	“abandoning one’s group”
AN	Authority vs. Subversion	“undermining tradition”
PN	Purity vs. Degradation	“contamination,” “disgust”
NN	Non-Morality	no moral content

Expert annotators, diverse in ideology, geography, race, and gender, selected minimal contiguous text spans that justified assigned moral labels. Spans were automatically tokenized to obtain binary masks for supervised alignment. Multi-hop reasoning was encouraged by precluding span re-use across labels. Annotators followed rigorous guidelines and completed pre-annotation questionnaires to capture potential sources of bias (Vargas et al., 7 Jan 2026).

4. Dataset: HateBRMoralXplain

SMRA is benchmarked on HateBRMoralXplain, which extends the HateBR corpus with comprehensive moral and socio-political annotations:

Size and Source: 7,000 Brazilian-Portuguese Instagram comments from six major public political accounts (balanced for political orientation and gender).
Labels: Each comment is annotated for binary hate (3,500 hate, 3,500 non-hate), 1–3 salience-ranked moral categories, their corresponding token-level rationales, and socio-political metadata (author’s party/gender, post themes, URL).
Annotation Quality: Cohen’s quadratic-weighted κ ranges from 0.611 to 0.811 across classes, denoting moderate to substantial reliability.
Data Characteristics: Each moral category is assigned in 20–35% of comments. Span lengths average 3–5 tokens (Vargas et al., 7 Jan 2026).

5. Experimental Setup and Metrics

SMRA is evaluated on:

Binary hate speech detection ({Hate, Non-Hate})
Multi-label moral sentiment classification ({NN, HN, FN, PN, AN, LN})

Baselines and Comparators:

Fine-tuned mBERT and BERTimbau
SMRA-augmented versions with attention supervision
Supervised Rational Attention (lexical rationales), CNN, BiRNN+Attention
LLMs: GPT-4o-mini, Llama-70B under several prompting schemes

Data Partitioning: 80% train, 10% validation, 10% test

Optimization: Batch size 16, learning rate $\mathbf{h}_1, \ldots, \mathbf{h}_L \in \mathbb{R}^d$ 1, up to 20 epochs, AdamW optimizer, max input 128 tokens

Metrics:

Classification: Accuracy, Macro-F1, AUROC
Plausibility: IOU-F1 and Token-F1 (average agreement between predicted and ground-truth rationales)
Faithfulness: Comprehensiveness (Comp) and Sufficiency (Suff) scores
Fairness/Bias: GMB-Sub, GMB-BPSN, GMB-BNSP (threshold-agnostic AUCs)

These metrics comprehensively assess predictive, explanatory, and fairness performance (Vargas et al., 7 Jan 2026).

6. Empirical Results and Comparative Analysis

SMRA consistently improves task performance and explanation quality without introducing additional subgroup bias:

Task	Backbone	Macro-F1 (Δ)	IOU-F1 (Δ pp)	Token-F1 (Δ pp)	Accuracy (Δ)	AUROC	Suff. (Δ pp)	GMB Fairness Impact
Hate/Non-hate	BERTimbau	0.9028→0.9114 (+0.86)	0.7612→0.8355 (+7.4)	0.8455→0.8958 (+5.0)	0.9029→0.9114	~0.965	+2.3	Minor fluctuation
Multi-label moral	BERTimbau	0.757→0.772 (+1.5)	—	—	—	0.925→0.927	—	—

Plausibility: Token-level IOU-F1 and Token-F1 improve substantially compared to both baselines and SRA (0.7612 vs. 0.7160 and 0.8455 vs. 0.7450, respectively).
Faithfulness: Rationale sufficiency improves (+2.3 pp), with explanations more concise yet still accounting for model predictions; comprehensiveness remains stable.
Fairness: Bias metrics (GMB-Sub, GMB-BPSN, GMB-BNSP) show only minor deviations, indicating enhanced interpretability does not exacerbate performance disparities across identity groups.
LLM Prompting: Including explicit moral value definitions and multi-task prompting in LLMs consistently outperforms hate-only prompts, but fine-tuned SMRA models achieve highest overall performance (Vargas et al., 7 Jan 2026).

7. Limitations and Prospects for Future Research

SMRA’s core advancement is the redirection of model inductive bias away from lexical patterns toward normatively and culturally meaningful evidence. This produces more interpretable, robust hate speech classifiers, particularly valuable for low-resource languages and contexts.

Notable limitations include the considerable cost of expert rationale annotation and the present restriction of evaluation to a single language and discourse domain. Directions for future research include scaling moral rationale acquisition via self-training or active learning, investigating cross-lingual transfer of supervision, integrating SMRA with LLM fine-tuning to reduce annotation burdens, and conducting extended sociological analyses to study annotator effects on moral categorization.

The SMRA framework represents a step forward in building transparent, fair classifiers whose outputs are justified by clearly articulated and normatively situated moral logic, providing auditable explanations consistent with human reasoning (Vargas et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Self-Explaining Hate Speech Detection with Moral Rationales (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Supervised Moral Rationale Attention (SMRA).

Supervised Moral Rationale Attention (SMRA)

1. Motivation and Theoretical Basis

2. Model Architecture and Training Procedure

3. Moral Annotation Schema and Process

4. Dataset: HateBRMoralXplain

5. Experimental Setup and Metrics

6. Empirical Results and Comparative Analysis

7. Limitations and Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Supervised Moral Rationale Attention (SMRA)

1. Motivation and Theoretical Basis

2. Model Architecture and Training Procedure

3. Moral Annotation Schema and Process

4. Dataset: HateBRMoralXplain

5. Experimental Setup and Metrics

6. Empirical Results and Comparative Analysis

7. Limitations and Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research