Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification (2004.14454v2)

Published 29 Apr 2020 in cs.CL

Abstract: The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sara Rosenthal (21 papers)
  2. Pepa Atanasova (27 papers)
  3. Georgi Karadzhov (20 papers)
  4. Marcos Zampieri (94 papers)
  5. Preslav Nakov (253 papers)
Citations (155)