- The paper introduces HateBERT, a retrained version of BERT using 1.5M abusive messages from banned Reddit communities to detect abusive language in English.
- The methodology involves pre-training on the RAL-E dataset using the Masked Language Model objective, which delivers superior performance over the baseline on OffensEval, AbusEval, and HatEval.
- The study demonstrates that targeted model adaptation enhances detection accuracy, generalizability, and portability, setting a new trajectory for automated abusive language detection.
HateBERT: Retraining BERT for Abusive Language Detection in English
The paper "HateBERT: Retraining BERT for Abusive Language Detection in English" presents a focused paper on developing a BERT-based model specifically for the task of detecting abusive language phenomena in the English language. Recognizing the challenges posed by general-purpose LLMs when applied to domain-specific tasks such as abusive language detection, the authors address this by introducing HateBERT—an adapted version of BERT retrained on a dataset comprising abusive language content from banned Reddit communities.
Methodology and Datasets
HateBERT was created by further pre-training the foundational BERT model using the Reddit Abusive Language English (RAL-E) dataset. This dataset contains approximately 1.5 million messages sourced predominantly from subreddits banned due to offensive and abusive content, which underscores its suitability for retraining a model specifically for detecting such language. The retraining process involved optimizing the existing BERT model through the Masked LLM (MLM) objective, tailoring it towards identifying language characterized by abuse and hate within online interactions.
The evaluation of HateBERT’s effectiveness was conducted across three well-defined datasets:
- OffensEval 2019: A dataset consisting of Twitter posts labeled for offensive content.
- AbusEval: Derived from OffensEval with additional annotations classifying overt abusive language.
- HatEval: An annotated collection tailored towards recognizing hate speech, focusing explicitly on hateful language against specific groups like women and migrants.
Results and Implications
The experimental results affirm that HateBERT consistently outperforms the baseline BERT model across all evaluated datasets. This outcome highlights HateBERT's suitability and robustness in recognizing different forms of abusive language, offering improved performance not only in the general detection of offensive content but also in more specific tasks such as identifying abusive or hateful speech.
The paper also explores the portability of HateBERT across different abusive language phenomena, exploring its capacity to generalize from one dataset to another. HateBERT demonstrated enhanced portability, particularly when trained on datasets with more generalized annotations. The results suggest that its adaptation through further pre-training enables better representation and internalization of linguistic nuances tied to abusive content.
Future Directions
The development of HateBERT emphasizes the importance and viability of model adaptation to address specific sub-domains within NLP, such as abusive language detection. It invites future work to extend this concept to other specialized areas within natural language understanding. Future studies might focus on assessing the potential for different embedding representations that HateBERT can derive compared with the general BERT model, as well as its performance in varied real-world abusive language scenarios.
In conclusion, HateBERT sets a precedent for enhancing LLMs through targeted retraining, thereby advancing the field’s capabilities in handling specific yet pervasive problems associated with online discourse. The research delineates a clear trajectory for subsequent improvements in automatic abusive language detection, fostering refined methodologies that could lead to more effective monitoring and alleviation of hostile language in digital spaces.