Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CONAN -- COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (1910.03270v1)

Published 8 Oct 2019 in cs.CL and cs.CY

Abstract: Although there is an unprecedented effort to provide adequate responses in terms of laws and policies to hate content on social media platforms, dealing with hatred online is still a tough problem. Tackling hate speech in the standard way of content deletion or user suspension may be charged with censorship and overblocking. One alternate strategy, that has received little attention so far by the research community, is to actually oppose hate content with counter-narratives (i.e. informed textual responses). In this paper, we describe the creation of the first large-scale, multilingual, expert-based dataset of hate speech/counter-narrative pairs. This dataset has been built with the effort of more than 100 operators from three different NGOs that applied their training and expertise to the task. Together with the collected data we also provide additional annotations about expert demographics, hate and response type, and data augmentation through translation and paraphrasing. Finally, we provide initial experiments to assess the quality of our data.

An Analysis of the CONAN Dataset: Advancing Counter-Narratives in Combating Online Hate Speech

The paper under review, titled "CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech" presents a novel dataset tailored for counter-narratives aimed at mitigating online hate speech. The authors have meticulously curated an extensive dataset in multiple languages—English, French, and Italian—through a process called nichesourcing. This involves leveraging expertise from trained individuals within Non-Governmental Organizations (NGOs), ensuring that the counter-responses adhere to high standards of both factual accuracy and constructive dialogue.

Dataset Development

The authors identified a crucial gap in available resources for combating online hate speech: the lack of diverse, multilingual datasets that couple hate speech instances with effective counter-narratives. To address this, they engaged over 100 operators from three NGOs, dedicating over 500 person-hours to generate a dataset containing 4078 hate speech/counter-narrative pairs. Each expert was instructed to generate responses rooted in credible facts and non-escalatory language, crucial elements for maintaining constructive discourse.

The dataset is noteworthy for several attributes underpinning future research:

  • Copy-free Accessibility: Unlike traditional reliance on ephemeral social media identifiers (e.g., tweet IDs), which can lead to significant data loss over time, CONAN's data is stored explicitly, ensuring persistent access.
  • Multilingual Format: The data spans three languages natively, with translations provided, thus facilitating cross-linguistic research and approach diversity in dealing with hate speech.
  • Expert-Verified Responses: The counter-narratives are grounded in the operational expertise of individuals educated and trained in the nuances of hate dismantling, providing a depth of quality not often found in datasets generated by the general online populace.

The authors extend the dataset through paraphrasing to enhance model training volumes—a necessary step in the deep learning paradigm—and focus on standardizing data with metadata including demographics and annotation of hate speech sub-categories.

Evaluation

To verify the robustness of the dataset, a series of experiments were conducted. Evaluation of counter-narrative relevance, through comparison with natural hate tweets from Twitter, indicates a substantial benefit from data augmentation (a 9% improvement in relevance). Further analysis revealed demographic influences on counter-narrative preferences, suggesting personalization could enhance efficacy. Remarkably, a same-gender configuration preferred counter-narratives more frequently, highlighting potential biases in perceived message effectiveness—a significant insight for tailoring automated tools to audience demographics.

Implications and Future Work

CONAN significantly contributes to counter-speech research by filling a pivotal need for structured, high-quality datasets. Methodologies for automatic counter-narrative generation could drastically advance by leveraging this dataset, possibly enhancing SMPs' capacity to quickly, effectively, and ethically respond to hate speech online.

The paper suggests ongoing work will expand the dataset further into other areas of hate speech, such as against migrants and the LGBTQ+ community, providing broader applicability. Future development includes integrating the dataset into automated counter-narrative generation systems, potentially transforming how hate speech is tackled at scale. Due to current rising trends in online hate, such tools are invaluable for NGOs, allowing them to focus resources on strategic dialogue and education.

Overall, the CONAN dataset represents a critical step towards systematic, analytically-backstopped techniques for countering online hate, offering expansive research prospects in NLP, hate speech moderation, and human-computer interaction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Y. L. Chung (1 paper)
  2. E. Kuzmenko (1 paper)
  3. S. S. Tekiroglu (1 paper)
  4. M. Guerini (1 paper)
Citations (188)
Youtube Logo Streamline Icon: https://streamlinehq.com