Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A Battle of LSTM and Transformers (2312.05671v1)
Abstract: Social media platforms serve as accessible outlets for individuals to express their thoughts and experiences, resulting in an influx of user-generated data spanning all age groups. While these platforms enable free expression, they also present significant challenges, including the proliferation of hate speech and offensive content. Such objectionable language disrupts objective discourse and can lead to radicalization of debates, ultimately threatening democratic values. Consequently, organizations have taken steps to monitor and curb abusive behavior, necessitating automated methods for identifying suspicious posts. This paper contributes to Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC) 2023 shared tasks track. We, team Z-AGI Labs, conduct a comprehensive comparative analysis of hate speech classification across five distinct languages: Bengali, Assamese, Bodo, Sinhala, and Gujarati. Our study encompasses a wide range of pre-trained models, including Bert variants, XLM-R, and LSTM models, to assess their performance in identifying hate speech across these languages. Results reveal intriguing variations in model performance. Notably, Bert Base Multilingual Cased emerges as a strong performer across languages, achieving an F1 score of 0.67027 for Bengali and 0.70525 for Assamese. At the same time, it significantly outperforms other models with an impressive F1 score of 0.83009 for Bodo. In Sinhala, XLM-R stands out with an F1 score of 0.83493, whereas for Gujarati, a custom LSTM-based model outshined with an F1 score of 0.76601. This study offers valuable insights into the suitability of various pre-trained models for hate speech detection in multilingual settings. By considering the nuances of each, our research contributes to an informed model selection for building robust hate speech detection systems.
- Associations Between Time Spent Using Social Media and Internalizing and Externalizing Problems Among US Youth, JAMA Psychiatry 76 (2019) 1266–1273. URL: https://doi.org/10.1001/jamapsychiatry.2019.2325. doi:10.1001/jamapsychiatry.2019.2325.
- Social media and mental health: Benefits, risks, and opportunities for research and practice, Journal of Technology in Behavioral Science 5 (2020). doi:10.1007/s41347-020-00134-x.
- Cyber and traditional bullying victimization as a risk factor for mental health problems and suicidal ideation in adolescents, PloS one 9 (2014) e94026. doi:10.1371/journal.pone.0094026.
- Overlapping toxic sentiment classification using deep neural architectures, in: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), 2018, pp. 1361–1366. doi:10.1109/ICDMW.2018.00193.
- Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection, Proceedings of the International AAAI Conference on Web and Social Media 14 (2020) 683–693. URL: https://ojs.aaai.org/index.php/ICWSM/article/view/7334. doi:10.1609/icwsm.v14i1.7334.
- A supervised multi-class multi-label word embeddings approach for toxic comment classification, in: International Conference on Knowledge Discovery and Information Retrieval, 2019. URL: https://api.semanticscholar.org/CorpusID:204754719.
- Overview of the HASOC subtracks at FIRE 2023: Hate speech and offensive content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023.
- Annihilate Hates (Task 4, HASOC 2023): Hate Speech Detection in Assamese, Bengali, and Bodo languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023.
- Overview of the hasoc subtrack at fire 2023: Hate-speech identification in sinhala and gujarati, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023.
- Using a semi-automatic keyword dictionary for improving violent web site filtering, in: 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, 2007, pp. 337–344. doi:10.1109/SITIS.2007.137.
- Hate speech detection with comment embeddings, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, Association for Computing Machinery, New York, NY, USA, 2015, p. 29–30. URL: https://doi.org/10.1145/2740908.2742760. doi:10.1145/2740908.2742760.
- Deep learning for hate speech detection in tweets, in: Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2017, p. 759–760. URL: https://doi.org/10.1145/3041021.3054223. doi:10.1145/3041021.3054223.
- Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2020, pp. 914–924. URL: https://aclanthology.org/2020.aacl-main.91.
- A. Saroj, S. Pal, An Indian language social media collection for hate and offensive speech, in: Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language, European Language Resources Association (ELRA), Marseille, France, 2020, pp. 2–8. URL: https://aclanthology.org/2020.restup-1.2.
- SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 54–63. URL: https://aclanthology.org/S19-2007. doi:10.18653/v1/S19-2007.
- ARHNet - leveraging community interaction for detection of religious hate speech in Arabic, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, Florence, Italy, 2019, pp. 273–280. URL: https://aclanthology.org/P19-2038. doi:10.18653/v1/P19-2038.
- Hindi-english hate speech detection: Author profiling, debiasing, and practical perspectives, Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020) 386–393. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5374. doi:10.1609/aaai.v34i01.5374.
- Mind your language: Abuse and offense detection for code-switched languages, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019) 9951–9952. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5112. doi:10.1609/aaai.v33i01.33019951.
- Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- I. Kwok, Y. Wang, Locate the hate: Detecting tweets against blacks, in: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, AAAI Press, 2013, p. 1621–1622.
- Detecting offensive language in social media to protect adolescent online safety, in: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, 2012, pp. 71–80. doi:10.1109/SocialCom-PASSAT.2012.55.
- R. Rajalakshmi, Y. Reddy, Dlrg@hasoc 2020: A hybrid approach for hate and offensive content identification in multilingual tweets, in: Fire, 2020. URL: https://api.semanticscholar.org/CorpusID:232314467.
- GloVe: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. URL: https://aclanthology.org/D14-1162. doi:10.3115/v1/D14-1162.
- Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. URL: https://aclanthology.org/Q17-1010. doi:10.1162/tacl_a_00051.
- Nsit & iiitdwd @ hasoc 2020: Deep learning model for hate-speech identification in indo-european languages, in: Fire, 2020. URL: https://api.semanticscholar.org/CorpusID:232313876.
- Iiit_dwd@hasoc 2020: Identifying offensive content in indo-european languages, in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 139–144. URL: https://ceur-ws.org/Vol-2826/T2-5.pdf.
- Tub at hasoc 2020: Character based lstm for hate speech detection in indo-european languages, in: Fire, 2020. URL: https://api.semanticscholar.org/CorpusID:232314731.
- BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. doi:10.18653/v1/n19-1423.
- Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747. doi:10.18653/v1/2020.acl-main.747.
- Muril: Multilingual representations for indian languages, arXiv preprint arXiv:2103.10730 (2021).
- BERTifying Sinhala - a comprehensive analysis of pre-trained language models for Sinhala text classification, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 7377–7385. URL: https://aclanthology.org/2022.lrec-1.803.
- Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 2612–2623. URL: https://aclanthology.org/2020.emnlp-main.207. doi:10.18653/v1/2020.emnlp-main.207.
- Does transliteration help multilingual language modeling?, in: Findings of the Association for Computational Linguistics: EACL 2023, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 670–685. URL: https://aclanthology.org/2023.findings-eacl.50. doi:10.18653/v1/2023.findings-eacl.50.
- K. Ghosh, D. A. Senapati, Hate speech detection: a comparison of mono and multilingual transformer model with cross-language evaluation, in: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, De La Salle University, Manila, Philippines, 2022, pp. 853–865. URL: https://aclanthology.org/2022.paclic-1.94.
- Overview of the hasoc subtrack at fire 2022: Hate speech and offensive content identification in english and indo-aryan languages, in: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’22, Association for Computing Machinery, New York, NY, USA, 2023, p. 4–7. URL: https://doi.org/10.1145/3574318.3574326. doi:10.1145/3574318.3574326.
- Overview of the hasoc subtrack at fire 2021: Hate speech and offensive content identification in english and indo-aryan languages and conversational hate speech, in: Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’21, Association for Computing Machinery, New York, NY, USA, 2022, p. 1–3. URL: https://doi.org/10.1145/3503162.3503176. doi:10.1145/3503162.3503176.
- emoji2vec: Learning emoji representations from their description, in: Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics, Austin, TX, USA, 2016, pp. 48–54. URL: https://aclanthology.org/W16-6208. doi:10.18653/v1/W16-6208.
- Nikhil Narayan (2 papers)
- Mrutyunjay Biswal (2 papers)
- Pramod Goyal (1 paper)
- Abhranta Panigrahi (2 papers)