HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (2012.10289v2)

Published 18 Dec 2020 in cs.CL, cs.AI, and cs.SI

Abstract: Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public at https://github.com/punyajoy/HateXplain

PDF Abstract

An Analysis of "HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection"

The paper "HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection" presents a nuanced approach to addressing the challenge of detecting hate speech on social media platforms. The authors introduce "HateXplain," a novel dataset constructed to facilitate more interpretable machine learning models in hate speech detection. The dataset is designed to enhance both the performance and explainability of automated systems, addressing current gaps in bias mitigation and interpretative transparency.

Dataset Characteristics

HateXplain distinguishes itself by incorporating three-dimensional annotations for each post:

Classification: Posts are labeled as hate, offensive, or normal.
Target Community Identification: The dataset identifies the targeted community, offering insights into which groups are frequently attacked.
Rationales: Annotated rationales indicate specific text elements influencing classification decisions.

The dataset is sourced from Twitter and Gab, comprising approximately 20,000 posts. This multi-faceted annotation helps in guiding models not only on the "what" but the "why," enhancing the opportunity for explainable AI in this domain.

Model Insights

The authors benchmark several existing models, such as BiRNN, CNN-GRU, and BERT, both with and without integrated human rationale for training. Models trained using human rationales demonstrated improvements in model plausibility and faithfulness, although they showed no significant advantage in overall classification performance metrics like accuracy or F1-score. Interestingly, embedding these rationales helps in mitigating unintended bias against specific communities.

Bias and Explainability Metrics

The paper introduces metrics to assess models on aspects of:

Bias Reduction: Using AUC-based metrics for unintended bias, the models' ability to avoid prejudice against target communities is quantitatively measured. Notably, models utilizing rationales managed to effectively diminish bias.
Explainability: Explored through plausibility and faithfulness, the paper employs IOU F1-score, token F1, and AUPRC scores to evaluate the alignment between model rationale and human annotation. However, a significant observation was the decrease in the explainability metrics in BERT models, indicating room for improvement in this area.

Implications and Future Research

HateXplain provides a foundational resource for the community by offering a benchmark that includes both classification and rationale annotations. The implications extend to improving transparency in machine learning models, ensuring compliance with regulations like GDPR, which demand explainability in automated decision-making.

The dataset's inclusion of targeted community information aids in reducing bias, a factor critically important when algorithms are applied in sensitive social contexts. Moreover, the insights into model interpretability underline the necessity for developing algorithms that justify decisions in a human-comprehensible manner.

Future work could expand on multilingual datasets and explore more deeply the integration of contextual information around a given post, such as user history or network interactions. Building on this work could facilitate more robust approaches to hate speech detection that are both fairer to users and more insightful for researchers.

In summary, this paper provides a valuable asset for hate speech detection research, emphasizing the necessity for both performance and interpretability. By targeting specific gaps in the current methodologies, HateXplain sets a precedent for responsible AI development in sensitive computational linguistics applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Binny Mathew (24 papers)
Punyajoy Saha (27 papers)
Seid Muhie Yimam (41 papers)
Chris Biemann (78 papers)
Pawan Goyal (170 papers)
Animesh Mukherjee (154 papers)

Citations (493)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - hate-alert/HateXplain: Can we use explanations to improve hate speech models? Our paper accepted at AAAI 2021 tries to explore that question. (208 stars)