Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TEET! Tunisian Dataset for Toxic Speech Detection (2110.05287v1)

Published 11 Oct 2021 in cs.CL and cs.AI

Abstract: The complete freedom of expression in social media has its costs especially in spreading harmful and abusive content that may induce people to act accordingly. Therefore, the need of detecting automatically such a content becomes an urgent task that will help and enhance the efficiency in limiting this toxic spread. Compared to other Arabic dialects which are mostly based on MSA, the Tunisian dialect is a combination of many other languages like MSA, Tamazight, Italian and French. Because of its rich language, dealing with NLP problems can be challenging due to the lack of large annotated datasets. In this paper we are introducing a new annotated dataset composed of approximately 10k of comments. We provide an in-depth exploration of its vocabulary through feature engineering approaches as well as the results of the classification performance of machine learning classifiers like NB and SVM and deep learning models such as ARBERT, MARBERT and XLM-R.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Slim Gharbi (1 paper)
  2. Heger Arfaoui (1 paper)
  3. Hatem Haddad (8 papers)
  4. Mayssa Kchaou (1 paper)
Citations (4)

Summary

We haven't generated a summary for this paper yet.