HatePrototypes: Interpretable Hate Speech
- HatePrototypes are class-level vector representations built by averaging LM hidden states to detect hate speech across domains.
- They enable efficient cross-domain transfer and parameter-free early exiting using a prototype-gap margin criterion to maintain robust macro-F1 scores.
- This approach provides an interpretable, data-efficient solution for both explicit and implicit hate speech detection with practical deployment benefits.
HatePrototypes are a family of class-level vector representations designed to provide interpretable, efficient, and transferable mechanisms for detecting both explicit and implicit hate speech in natural language content. This approach leverages prototype centroids computed directly from hidden states of fine-tuned LMs, enabling cross-domain transferability and parameter-free early exiting, thereby simplifying deployment and adaptation across varying moderation tasks (Proskurina et al., 9 Nov 2025).
1. Formal Definition and Prototype Construction
Given a labeled hate-speech corpus with for non-hate and hate classes, HatePrototypes are constructed as follows. Let denote the hidden representation of input at layer , where is the LM hidden size.
For each class and layer , the prototype centroid is
where is the subset of inputs in class .
At inference, both the sample representation and prototypes are -normalized: Classification is performed by the dot product similarity:
Unlike contrastive learning or clustering-based approaches, HatePrototypes do not introduce learned parameters or a prototype-alignment loss; prototypes are computed by averaging fine-tuned LM activations.
2. Data Efficiency: Minimal Prototype Sets
Empirical results demonstrate that prototypes remain robust even when constructed from very few examples. When using only labeled samples per class, macro-F1 saturates and deviates generically by less than 2 pp from using $500$ samples per class. Prototype selection entails random sampling, centroid calculation, and variance estimation via repetition over draws.
No additional regularization is necessary beyond normalization; the averaging suppresses noise, and degradation only becomes notable below per class.
3. Transferability Across Explicit and Implicit Hate Benchmarks
HatePrototypes support transfer across benchmarks targeting explicit (surface-level abusive terms, targeted slurs) and implicit hate (demeaning comparisons, exclusionary suggestions, disguised violence). Two key transfer scenarios:
- Cross-domain: Using prototype centroids from one dataset with a model fine-tuned on another, e.g., classifying SBIC test data using SBIC prototypes and an OPT model fine-tuned on HateXplain.
- Prototype-based transfer: Using prototypes computed from dataset to classify dataset samples with a classifier trained on .
Transfer efficiency is measured by the relative macro-F1 ratio
High-probability transfer is observed: BERT-based HatePrototypes retain >90% F1 across implicit/explicit benchmarks such as IHC, SBIC, OLID, and HateXplain. OPT-based transfer is less robust, with drops to in challenging pairs (notably IHCSBIC).
A plausible implication is that prototype averaging over LM feature space captures generic semantic features relevant to hate speech, permitting interchangeable use across datasets with differing hate type distributions.
4. Parameter-Free Early Exiting
Efficiency is enhanced via a prototype-gap margin criterion enabling early exit before full LM forward pass. For each layer , define margin
where and are the highest and next-highest prototype similarities. The exit rule is to stop at the lowest layer where
for fixed threshold ; otherwise, use the full model. This approach requires no additional learnable parameters.
Experiments reveal an average reduction of in forward-pass layers with negligible F1 impact. Compared to entropy-based (DeeBERT) and patience-based (PABEE) gating, prototype-gap exiting matches or surpasses performance, particularly on explicit hate tasks.
Thresholds for are task-dependent: for explicit HateXplain, ; for implicit SBIC, .
5. Quantitative Performance and Experimental Setup
Models evaluated include BERT-base, OPT-125M, LLaMA-Guard-1B, and BLOOMZ-Guard-3B on IHC (implicit), SBIC (implicit), OLID (explicit), and HateXplain (explicit). Standard LM fine-tuning is performed for three epochs, at learning rate, batch size 64. Prototype construction uses training splits, with up to 500 samples per class.
Key results:
- Cross-domain F1 increases up to +28 pp (e.g., BERT: HateXplainSBIC: +28.02 F1).
- In-domain and cross-domain pairs: prototype-based classification matches or slightly exceeds fine-tuned head performance.
- Guard models improve on explicit hate detection when swapping to HatePrototypes, e.g., LLaMA-Guard accuracy on OLID increases from 46.9% to 71.3%.
- Early exiting achieves same F1 as full-model with up to 1.5× speed-ups.
The approach runs on a single NVIDIA A100 (80 GB) and demonstrates negligible deployment constraints given its parameter-free inference.
6. Qualitative and Error Analysis
Error analysis on IHC categories identifies incitement (disguised calls for violence or solidarity) and irony (question-answer riddles with discriminatory encoding) as the most challenging for cross-domain transfer. For example, accuracy drops to 40–58% for incitement if prototypes do not encode implicit hate concept geometry.
Layer-wise analysis demonstrates that implicit hate samples require deeper semantic processing, exiting at layers 9–12, while explicit hate can often be detected before layer 8 with stable margins.
Low prototype-similarity cases often indicate out-of-distribution, adversarial, or under-represented examples. This suggests potential use of prototype similarity as an uncertainty signal for active learning or ambiguous annotation surfacing.
7. Limitations and Directions for Future Research
Prototype-gap early exiting may degrade out-of-domain performance without careful tuning. A per-layer threshold schedule could mitigate this issue. Prototype-based transfer lags the performance of fully fine-tuned heads on harder domain pairs; learning a small alignment head atop static prototypes is a plausible avenue.
Implicit hate datasets suffer from low inter-annotator agreement, impacting prototype quality. More granular annotation would directly benefit the approach. HatePrototypes may be suited for active-learning pipelines, highlighting ambiguous or atypical examples for further annotation.
Overall, HatePrototypes offer a parameter-free, data-efficient, and interpretable solution for detecting and transferring both explicit and implicit hate speech, enabling practical deployment and insight into LM decision boundaries without repeated re-training or contrastive learning (Proskurina et al., 9 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free