Supervised Moral Rationale Attention (SMRA)
- The paper introduces SMRA, a framework that supervises model attention using expert-annotated moral rationales drawn from Moral Foundations Theory.
- It extends Transformer-based classifiers by aligning token-level attention with culturally contextualized moral evidence, reducing reliance on spurious lexical cues.
- Empirical evaluations on HateBRMoralXplain show significant improvements in classification metrics and explanation quality, with enhanced robustness and fairness.
Supervised Moral Rationale Attention (SMRA) is a framework for self-explaining hate speech detection that directly supervises model attention using expert-annotated moral rationales, shifting the focus of neural classifiers from spurious surface-level lexical cues to normatively salient linguistic evidence. By integrating annotations grounded in Moral Foundations Theory (MFT), SMRA generates model-intrinsic, culturally contextualized explanations and achieves improved robustness, interpretability, and cross-cultural generalization in both hate speech and moral sentiment classification tasks (Vargas et al., 7 Jan 2026).
1. Motivation and Theoretical Basis
Neural hate speech classifiers, while performant, historically rely on surface-level lexical features, leading to two principal limitations: (1) vulnerability to spurious correlations and poor generalization across cultural contexts, and (2) lack of robust, transparent model explanations. Post-hoc explanation techniques (e.g., gradient-based or surrogate rationales) are frequently unfaithful and computationally expensive at inference, while previous self-explaining models (such as Supervised Rational Attention, SRA) typically leverage lexical rationales that do not capture the normative, moral substrate of hate speech (Vargas et al., 7 Jan 2026).
Hate speech is fundamentally moral in character, often justified by appeals to moral sentiments such as purity, authority, or loyalty. Moral Foundations Theory provides a cross-culturally validated taxonomy of five core dimensions—Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation—each expressed as virtue or violation. SMRA leverages this theory by aligning model attention with human-annotated rationales grounded in these moral categories, aiming to produce more generalizable, culturally aware, and auditable detection systems (Vargas et al., 7 Jan 2026).
2. Model Architecture and Training Procedure
SMRA extends a Transformer-based text classifier (e.g., BERTimbau) by supervising token-level attention alignment with annotated moral rationales. Given a tokenized input and contextual embeddings , the [CLS] embedding is passed through a linear-softmax head for hate or moral category classification.
The model extracts a normalized attention vector from the final encoder layer, typically referencing attention weights from the [CLS] token to each input token. Supervised moral rationale alignment is achieved via a token-wise cross-entropy loss: where the binary rationale mask is derived from expert-annotated, minimal text spans corresponding to moral justifications.
The final training objective combines cross-entropy classification loss with the attention alignment loss: with hyperparameter (chosen ), and applied selectively to samples labeled with at least one moral category (Vargas et al., 7 Jan 2026).
3. Moral Annotation Schema and Process
The SMRA approach uses six mutually exclusive labels built on the five MFT dimensions plus a “Non-Morality” (NN) category. The labels are:
| Abbreviation | MFT Dimension (Virtue/Violation Collapsed) | Example Justification |
|---|---|---|
| HN | Care vs. Harm | “inflicting cruelty” |
| FN | Fairness vs. Cheating | “stealing rights” |
| LN | Loyalty vs. Betrayal | “abandoning one’s group” |
| AN | Authority vs. Subversion | “undermining tradition” |
| PN | Purity vs. Degradation | “contamination,” “disgust” |
| NN | Non-Morality | no moral content |
Expert annotators, diverse in ideology, geography, race, and gender, selected minimal contiguous text spans that justified assigned moral labels. Spans were automatically tokenized to obtain binary masks for supervised alignment. Multi-hop reasoning was encouraged by precluding span re-use across labels. Annotators followed rigorous guidelines and completed pre-annotation questionnaires to capture potential sources of bias (Vargas et al., 7 Jan 2026).
4. Dataset: HateBRMoralXplain
SMRA is benchmarked on HateBRMoralXplain, which extends the HateBR corpus with comprehensive moral and socio-political annotations:
- Size and Source: 7,000 Brazilian-Portuguese Instagram comments from six major public political accounts (balanced for political orientation and gender).
- Labels: Each comment is annotated for binary hate (3,500 hate, 3,500 non-hate), 1–3 salience-ranked moral categories, their corresponding token-level rationales, and socio-political metadata (author’s party/gender, post themes, URL).
- Annotation Quality: Cohen’s quadratic-weighted κ ranges from 0.611 to 0.811 across classes, denoting moderate to substantial reliability.
- Data Characteristics: Each moral category is assigned in 20–35% of comments. Span lengths average 3–5 tokens (Vargas et al., 7 Jan 2026).
5. Experimental Setup and Metrics
SMRA is evaluated on:
- Binary hate speech detection ({Hate, Non-Hate})
- Multi-label moral sentiment classification ({NN, HN, FN, PN, AN, LN})
Baselines and Comparators:
- Fine-tuned mBERT and BERTimbau
- SMRA-augmented versions with attention supervision
- Supervised Rational Attention (lexical rationales), CNN, BiRNN+Attention
- LLMs: GPT-4o-mini, Llama-70B under several prompting schemes
Data Partitioning: 80% train, 10% validation, 10% test
Optimization: Batch size 16, learning rate , up to 20 epochs, AdamW optimizer, max input 128 tokens
Metrics:
- Classification: Accuracy, Macro-F1, AUROC
- Plausibility: IOU-F1 and Token-F1 (average agreement between predicted and ground-truth rationales)
- Faithfulness: Comprehensiveness (Comp) and Sufficiency (Suff) scores
- Fairness/Bias: GMB-Sub, GMB-BPSN, GMB-BNSP (threshold-agnostic AUCs)
These metrics comprehensively assess predictive, explanatory, and fairness performance (Vargas et al., 7 Jan 2026).
6. Empirical Results and Comparative Analysis
SMRA consistently improves task performance and explanation quality without introducing additional subgroup bias:
| Task | Backbone | Macro-F1 (Δ) | IOU-F1 (Δ pp) | Token-F1 (Δ pp) | Accuracy (Δ) | AUROC | Suff. (Δ pp) | GMB Fairness Impact |
|---|---|---|---|---|---|---|---|---|
| Hate/Non-hate | BERTimbau | 0.9028→0.9114 (+0.86) | 0.7612→0.8355 (+7.4) | 0.8455→0.8958 (+5.0) | 0.9029→0.9114 | ~0.965 | +2.3 | Minor fluctuation |
| Multi-label moral | BERTimbau | 0.757→0.772 (+1.5) | — | — | — | 0.925→0.927 | — | — |
- Plausibility: Token-level IOU-F1 and Token-F1 improve substantially compared to both baselines and SRA (0.7612 vs. 0.7160 and 0.8455 vs. 0.7450, respectively).
- Faithfulness: Rationale sufficiency improves (+2.3 pp), with explanations more concise yet still accounting for model predictions; comprehensiveness remains stable.
- Fairness: Bias metrics (GMB-Sub, GMB-BPSN, GMB-BNSP) show only minor deviations, indicating enhanced interpretability does not exacerbate performance disparities across identity groups.
- LLM Prompting: Including explicit moral value definitions and multi-task prompting in LLMs consistently outperforms hate-only prompts, but fine-tuned SMRA models achieve highest overall performance (Vargas et al., 7 Jan 2026).
7. Limitations and Prospects for Future Research
SMRA’s core advancement is the redirection of model inductive bias away from lexical patterns toward normatively and culturally meaningful evidence. This produces more interpretable, robust hate speech classifiers, particularly valuable for low-resource languages and contexts.
Notable limitations include the considerable cost of expert rationale annotation and the present restriction of evaluation to a single language and discourse domain. Directions for future research include scaling moral rationale acquisition via self-training or active learning, investigating cross-lingual transfer of supervision, integrating SMRA with LLM fine-tuning to reduce annotation burdens, and conducting extended sociological analyses to study annotator effects on moral categorization.
The SMRA framework represents a step forward in building transparent, fair classifiers whose outputs are justified by clearly articulated and normatively situated moral logic, providing auditable explanations consistent with human reasoning (Vargas et al., 7 Jan 2026).