Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Offensive Language Identification in Transliterated and Code-Mixed Bangla (2311.15023v1)

Published 25 Nov 2023 in cs.CL

Abstract: Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of SemEval.
  2. Hatebert: Retraining bert for abusive language detection in english. In Proceedings of WOAH.
  3. Çağrı Çöltekin. 2020. A Corpus of Turkish Offensive Language on Social Media. In Proceedings of LREC.
  4. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
  5. Amitava Das and Björn Gambäck. 2015. Code-mixing in social media text: The last language identification frontier? Revue TAL - Association pour le Traitement Automatique des Langues (ATALA).
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
  7. Toxic, hateful, offensive, or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of LREC.
  8. Cross-lingual offensive language identification for low resource languages: The case of marathi. In Proceedings of RANLP.
  9. Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. In Proceedings of RANLP.
  10. A machine learning approach to identify toxic language in the online space. In Proceedings of ASONAM.
  11. Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network. In Proceedings of DSAA.
  12. Deephateexplainer: Explainable hate speech detection in under-resourced bengali language. In Proceedings of DSAA.
  13. Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
  14. Roberta: A robustly optimized bert pretraining approach. In Proceedings of ACL.
  15. Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english and german. In Proceedings of FIRE.
  16. Pieter Muysken et al. 2000. Bilingual speech: A typology of code-mixing. Cambridge University Press.
  17. Carol Myers-Scotton. 1997. Duelling languages: Grammatical structure in codeswitching. Oxford University Press.
  18. Offensive language identification in greek. In Proceedings of LREC.
  19. SOLD: Sinhala offensive language dataset. arXiv preprint arXiv:2212.00851.
  20. fbert: A neural transformer for identifying offensive content. In Findings of the ACL.
  21. Gudbjartur Ingi Sigurbergsson and Leon Derczynski. 2020. Offensive Language and Hate Speech Detection for Danish. In Proceedings of LREC.
  22. L-boost: Identifying offensive texts from social media post in bengali. Ieee Access, 9:164681–164699.
  23. The decades progress on code-switching research in NLP: A systematic survey on trends and challenges. In Findings of the ACL.
  24. Predicting the type and target of offensive posts in social media. In Proceedings of NAACL.
  25. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of SemEval.
  26. Predicting the type and target of offensive social media posts in Marathi. Social Network Analysis and Mining, 12(1).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Md Nishat Raihan (14 papers)
  2. Umma Hani Tanmoy (1 paper)
  3. Anika Binte Islam (1 paper)
  4. Kai North (13 papers)
  5. Tharindu Ranasinghe (52 papers)
  6. Antonios Anastasopoulos (111 papers)
  7. Marcos Zampieri (94 papers)
Citations (10)