Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages (2310.02249v1)

Published 3 Oct 2023 in cs.CL and cs.LG

Abstract: In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati. The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content. Leveraging the HASOC 2023 datasets, we fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech. Our findings underscore the superiority of monolingual sentence-BERT models, particularly in the Bengali language, where we achieved the highest ranking. However, the performance in Assamese and Gujarati languages signifies ongoing opportunities for enhancement. Our goal is to foster inclusive online spaces by countering hate speech proliferation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. A review of challenges in machine learning based automated hate speech detection, arXiv preprint arXiv:2209.05294 (2022).
  2. L3cube-indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert, arXiv preprint arXiv:2304.11434 (2023).
  3. R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages, arXiv preprint arXiv:2211.11418 (2022).
  4. Overview of the hasoc subtrack at fire 2021: Conversational hate speech detection in code-mixed language, Working Notes of FIRE (2021) 13–17.
  5. Hate and offensive speech detection in hindi and marathi, arXiv preprint arXiv:2110.12200 (2021).
  6. A twitter bert approach for offensive language detection in marathi, arXiv preprint arXiv:2212.10039 (2022).
  7. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models, in: Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), 2022, pp. 1–9.
  8. Baseline bert models for conversational hate speech detection in code-mixed tweets utilizing data augmentation and offensive language identification in marathi, in: Fire, 2022. URL: https://api.semanticscholar.org/CorpusID:259123570.
  9. S. Ghosal, A. Jain, Hatecircle and unsupervised hate speech detection incorporating emotion and contextual semantics, ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22 (2023). URL: https://doi.org/10.1145/3576913. doi:10.1145/3576913.
  10. Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, 2022, pp. 121–128.
  11. K. Ghosh, D. A. Senapati, Hate speech detection: a comparison of mono and multilingual transformer model with cross-language evaluation, in: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, De La Salle University, Manila, Philippines, 2022, pp. 853–865. URL: https://aclanthology.org/2022.paclic-1.94.
  12. Hate speech and offensive language detection in bengali, arXiv preprint arXiv:2210.03479 (2022).
  13. Transformer-based hate speech detection in assamese, in: 2023 IEEE Guwahati Subsection Conference (GCON), 2023, pp. 1–5. doi:10.1109/GCON58516.2023.10183497.
  14. Hate speech and offensive language detection in dravidian languages using deep ensemble framework, Computer Speech & Language 75 (2022) 101386. URL: https://www.sciencedirect.com/science/article/pii/S0885230822000250. doi:https://doi.org/10.1016/j.csl.2022.101386.
  15. S. Sai, Y. Sharma, Towards offensive language identification for dravidian languages, in: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 2021, pp. 18–27.
  16. N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019).
  17. L3cube-mahasbert and hindsbert: Sentence bert models and benchmarking bert sentence representations for hindi and marathi, in: Science and Information Conference, Springer, 2023, pp. 1184–1199.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ananya Joshi (9 papers)
  2. Raviraj Joshi (76 papers)
Citations (2)