HateBERT: Domain-Adaptive Hate Speech Model

Updated 19 November 2025

HateBERT is a domain-adaptive neural language model retrained on banned Reddit corpora to enhance the detection of hate speech and offensive content.
It employs a BERT-base architecture with focused masked language model pretraining, achieving 89.16% accuracy and a 0.86 macro-F1 score on cyberbullying benchmarks.
The model’s retraining strategy highlights that leveraging abuse-centric data substantially improves NLP performance for identifying hostile online communication.

HateBERT is a domain-adaptive neural LLM designed to enhance the detection of abusive language, including hate speech, insults, and offensive content, across digital social platforms. It leverages a BERT-base architecture retrained on large-scale corpora from banned Reddit communities to bias learned representations toward the detection of toxic linguistic patterns. HateBERT has demonstrated state-of-the-art performance over vanilla transformer and recurrent baselines for cyberbullying classification, particularly on tasks requiring nuanced identification of insults and abusive behaviors in social media posts (Biswas et al., 1 Apr 2024, Caselli et al., 2020). The model's development formalizes the principle that continued pretraining on ecologically relevant, abuse-focused data substantially improves both the accuracy and robustness of NLP models targeting hostile online communication.

1. Model Architecture and Mathematical Formulation

HateBERT maintains the original BERT-base-uncased architecture—comprising 12 transformer encoder layers, each with hidden size $d=768$ , 12 bidirectional self-attention heads, and a per-head dimension $d_k=64$ —and does not introduce additional stacking (e.g., BiLSTM) or new tokens in the vocabulary (Biswas et al., 1 Apr 2024, Caselli et al., 2020).

Input posts are tokenized using WordPiece, with typical maximum document lengths $L$ in $[128, 512]$ . The special tokens [CLS] and [SEP] demarcate sentence boundaries. The transformer stack computes contextualized token embeddings through multi-head self-attention:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V, \quad \mathrm{head} = \mathrm{softmax} \left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

where $X \in \mathbb{R}^{L \times d}$ are input representations and $W_Q$ , $W_K$ , $W_V \in \mathbb{R}^{d \times d}$ .

For downstream classification, the final hidden state of the [CLS] token ( $h_{\text{CLS}} \in \mathbb{R}^{768}$ ) is processed by a linear layer and softmax for binary classification:

$z = h_{\text{CLS}} W_c + b_c, \quad p_c = \frac{e^{z_c}}{\sum_{c'} e^{z_{c'}}}$

with cross-entropy loss:

$L_{\text{cls}} = -\frac{1}{N} \sum_{i=1}^N \sum_{c \in \{0,1\}} y_{i,c} \log p_{i,c}$

No additional regularization (e.g., $L_1$ , $L_2$ ) is employed (Biswas et al., 1 Apr 2024).

2. Data Sources and Pretraining Procedures

HateBERT extends the unsupervised pretraining regimen of BERT by conducting masked LLM (MLM) training on specialized corpora. The original BERT was pretrained on English Wikipedia and BookCorpus (aggregate $\approx$ 3.3 billion words). HateBERT continues pretraining on $\approx$ 1.5 million Reddit messages from communities that were administratively banned for promoting hate or harassment (the RAL-E dataset).

RAL-E contains 43.8 million tokens and was constructed by crawling Pushshift Reddit dumps for communities banned in 2015 (31 subreddits, e.g., r/fatpeoplehate, r/blackpeoplehate) over the period Jan 2012–June 2015 (Caselli et al., 2020). Preprocessing steps include normalization of mentions, URLs, and emojis, and collapsing whitespace (Caselli et al., 2020).

No stratified sampling or data filtering is performed beyond subreddit inclusion, resulting in a high-precision but potentially sociolinguistically skewed abusive register. HateBERT’s retraining runs 100 epochs (Adam, learning rate $5 \times 10^{-5}$ , batch size 64, sequence length 512), strictly minimizing the MLM objective (Caselli et al., 2020).

3. Fine-Tuning, Hyperparameters, and Evaluation Setup

For downstream abusive language detection, HateBERT is fine-tuned on task-specific datasets, most notably the Kaggle “Detecting Insults in Social Commentary” (3,947 training tweets, 2,647 test tweets) (Biswas et al., 1 Apr 2024), as well as Twitter-based OLID, HatEval, and AbusEval benchmarks (Caselli et al., 2020). Standard preprocessing is retained (tokenization, lowercasing, removal of escape characters and stop words, sequence padding/truncation).

HateBERT adheres to standard transformer optimization configurations. While hyperparameters are not always precisely reported, typical settings are: batch size 16–32, learning rate $2 \times 10^{-5}$ to $5 \times 10^{-5}$ , epochs 2–4, AdamW optimizer (weight decay 0.01), and dropout 0.1 on the classification head (Biswas et al., 1 Apr 2024). Train/validation/test splits often follow a 60/20/20 percentage.

Performance metrics include per-class precision, recall, F1-score, macro-average F1, and overall accuracy:

$\text{Precision}_c = \frac{TP_c}{TP_c + FP_c}, \quad \text{Recall}_c = \frac{TP_c}{TP_c + FN_c}, \quad F1_c = 2 \cdot \frac{\text{Precision}_c \cdot \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}$

$\text{Accuracy} = \frac{TP_0 + TP_1}{TP_0 + FP_0 + FN_0 + TP_1 + FP_1 + FN_1}$

4. Empirical Results and Comparative Analysis

On the cyberbullying insult detection benchmark, HateBERT achieves a test accuracy of $89.16\%$ and macro-F1 of $0.86$, outperforming standard BERT (accuracy $83.78\%$ , macro-F1 $0.78$), RoBERTa, BiLSTM variants, and prior FastText-based baselines (Biswas et al., 1 Apr 2024). The F1-score for the “Insult” class is $0.79$, compared to $0.68$ (BERT) and $0.63$–$0.68$ (BiLSTM).

Consistent performance gains are also observed on OLID (macro-F1: HateBERT $0.801$, BERT $0.794$), AbusEval, and HatEval (Caselli et al., 2020). In portability experiments, HateBERT demonstrates superior robustness when fine-tuned on one dataset and evaluated on a related but not identical phenomenon, with precision improvements for “abusive" and "hateful" positive classes (Caselli et al., 2020).

Tabulated results for insult detection (from (Biswas et al., 1 Apr 2024)):

Model	Macro-F1	Accuracy
BERT-base	0.78	83.78%
hateBERT	0.86	89.16%
RoBERTa	0.79	85.59%
BiLSTM (no FE)	0.76	82.18%
BiLSTM (FastText FE)	0.80	83.32%
Baseline [9]	—	82.49%

5. Analysis of Limitations and Future Directions

The margin between “Neutral” and “Insult” detection ($0.93$ vs. $0.79$ F1) reveals persistent challenges with subtle, code-mixed, or linguistically creative abuse, likely exacerbated by class imbalance and out-of-vocabulary variance in hostile registers (Biswas et al., 1 Apr 2024). No ablation studies have probed the effect of abusive corpus size or the blend between pretraining and fine-tuning data, though these remain crucial for understanding domain-adaptive benefit.

Potential biases in HateBERT’s training substrate, reflecting extremes of abusive social discourse, may limit generalizability to less overt forms of toxicity and micro-aggression (Caselli et al., 2020). The model lacks explicit treatment for implicit, coded, or evolving slang, and does not incorporate explicit mitigation against class imbalance.

Recommended extensions include data augmentation for minority classes (e.g., paraphrasing), hybrid contextual-sequential architectures (e.g., BERT+BiLSTM), and the deployment of model output in real-time crowdsensing frameworks for rapid social risk alerts (Biswas et al., 1 Apr 2024). Cross-lingual transfer, embedding-space probing, and expansion to multimodal (e.g., meme and image) toxic content are proposed as avenues for further investigation (Caselli et al., 2020).

fBERT (“FBERT: A Neural Transformer for Identifying Offensive Content” (Sarkar et al., 2021)) was similarly retrained for abusive language detection but uses the SOLID corpus (∼1.45M Twitter instances with weak supervision) and demonstrates empirically higher macro-F1 compared to HateBERT across HatEval, OLID, and Davidson hate speech tasks ($0.596$ vs. $0.525$ on HatEval English, $0.813$ vs. $0.801$ OLID, $0.878$ vs. $0.846$ Davidson) (Sarkar et al., 2021).

This suggests that larger, more diverse weakly-supervised corpora (e.g., SOLID) confer greater generalization and recall for broad offensive content than domain-specific (Reddit-only) abuse registers. fBERT’s thresholding for inclusion of offensive data (optimal at $T \geq 0.5$ on the offensive score) supports a strategic tradeoff between data quantity and annotation purity, with loss in performance at higher thresholds (Sarkar et al., 2021).

In summary, HateBERT exemplifies a highly efficient repurposing of BERT’s general linguistic representation through targeted retraining on abuse-intensive data, yielding robust classifiers for cyberbullying and hate speech. Its performance is competitive but not dominant relative to models like fBERT that exploit larger, broader corpora for domain-adaptive pretraining.

PDF Markdown Chat (Pro)

References (3)

Securing Social Spaces: Harnessing Deep Learning to Eradicate Cyberbullying (2024)

HateBERT: Retraining BERT for Abusive Language Detection in English (2020)

FBERT: A Neural Transformer for Identifying Offensive Content (2021)

Follow Topic

Get notified by email when new papers are published related to HateBERT.