Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification (2310.18387v2)

Published 27 Oct 2023 in cs.CL and cs.AI

Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages. We experiment with several models on this dataset and observe that BanglishBERT outperforms other transformer-based models and GPT-3.5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Suraiya Alam. 2006. Code-mixing in bangladesh: A case study of non-government white-collar service holders and professionals. Asian affairs, 28(4):52–70.
  2. Effects of code mixing in indian film songs. Journal of Media Studies, 31(2).
  3. Fotini Anastassiou. 2017. Factors associated with the code mixing and code-switching of multilingual children: An overview. International Journal of Linguistics, Literature and Culture, 4(3):13–26.
  4. “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of CodeSwitch.
  5. Most Tasnim Begum and Md Mahmudul Haque. 2013. Code mixing in the ksa: A case study of expatriate bangladeshi and indian esl teachers. Arab World English Journal, 4(4).
  6. BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the ACL.
  7. A dataset of hindi-english code-mixed social media text for hate speech detection. In Proceedings of PEOPLES.
  8. Krista Byers-Heinlein and Casey Lew-Williams. 2013. Bilingualism in the early years: What the science says. LEARNing landscapes, 7(1):95.
  9. HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of WOAH.
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
  11. Databricks. 2023. Dolly 2.0: An open source, instruction-following large language model. Accessed: 2023-09-10.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
  13. Hindi-english code mixed hate speech detection using character level embeddings. In Proceedings of ICCMC.
  14. Reviewing the challenges and opportunities presented by code switching and mixing in bangla. Journal of Education and Practice, 6(1):103–109.
  15. Didar Hossain and Kapil Bar. 2015. A case study in code-mixing among jahangirnagar university students. International Journal of English and Literature, 6(7):123–139.
  16. Abusive comments detection in bangla-english code-mixed and transliterated text. In Proceedings of ICIET.
  17. Jill V Jeffery and Catherine van Beuningen. 2020. Language education in the eu and the us: Paradoxes and parallels. Prospects, 48(3-4):175–191.
  18. A survey of current datasets for code-switching research. In Proceedings of ICACCS.
  19. Fasttext.zip: Compressing text classification models. CoRR, abs/1612.03651.
  20. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the ACL.
  21. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
  22. Harnessing code switching to transcend the linguistic barrier. In Proceedings of IJCAI.
  23. Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
  24. Cross-lingual text classification of transliterated hindi and malayalam. In Proceedings of Big Data.
  25. Evaluating aggression identification in social media. In Proceedings of TRAC.
  26. Aggression-annotated Corpus of Hindi-English Code-mixed Data. In Proceedings of LREC.
  27. Kirti Kumari and Jyoti Prakash Singh. 2020. Ai_ml_nit_patna@ trac-2: Deep learning approach for multi-lingual aggression identification. In Proceedings of TRAC.
  28. Tarald O Kvålseth. 1989. Note on cohen’s kappa. Psychological reports, 65(1):223–226.
  29. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  30. Ahmad Mahbub-ul Alam and Shaima Quyyum. 2016. A sociolinguistic survey on code switching & code mixing by the native speakers of bangladesh. Journal of Manarat International University, 6(1):8–9.
  31. Daniele Mazzocchi. 2012. langdetect: Language detection library. Python library.
  32. Shikha Mundra and Namita Mittal. 2022. Fa-net: fused attention-based network for hindi english code-mixed offensive text classification. Social Network Analysis and Mining, 12(1):100.
  33. Ravindra Nayak and Raviraj Joshi. 2022. L3Cube-HingCorpus and HingBERT: A code mixed Hindi-English dataset and BERT language models. In Proceedings of WILDRE.
  34. Nick Doiron. 2023. hindi-bert. Accessed: 2023-09-10.
  35. Jianzhi Nie. 2023. Awesome instruction datasets. Accessed: 2023-09-10.
  36. OpenAI. 2023. Gpt-3.5 turbo fine-tuning and api updates. Accessed: 2023-08-28.
  37. Tharindu Ranasinghe and Marcos Zampieri. 2021. An evaluation of multilingual offensive language identification methods for the languages of india. Information, 12(8):306.
  38. A comparative study of different state-of-the-art hate speech detection methods in hindi-english code-mixed data. In Proceedings of TRAC.
  39. Manikandan Ravikiran and Subbiah Annamalai. 2021. Dosa: Dravidian code-mixed offensive span identification dataset. In Proceedings of TRAC.
  40. SOLID: A large-scale semi-supervised dataset for offensive language identification. In Findings of the ACL.
  41. Siva Sai and Yashvardhan Sharma. 2020. Siva@ hasoc-dravidian-codemix-fire-2020: Multilingual offensive speech detection in code-mixed and romanized text. In Proceedings of FIRE.
  42. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of EMC2.
  43. Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of AdaptNLP.
  44. fBERT: A neural transformer for identifying offensive content. In Findings of the ACL.
  45. Rajendra Singh. 1985. Grammatical constraints on code-mixing: Evidence from hindi-english. Canadian Journal of Linguistics/Revue canadienne de linguistique, 30(1):33–45.
  46. Detection of hate speech text in hindi-english code-mixed data. Procedia Computer Science, 171:737–744.
  47. Code-mixing: A brief survey. In Proceedings of ICACCI.
  48. Charangan Vasantharajan and Uthayasanker Thayasivam. 2021. Hypers@ dravidianlangtech-eacl2021: Offensive language identification in dravidian code-mixed youtube comments and posts. In Proceedings of DravidianLangTech.
  49. L-boost: Identifying offensive texts from social media post in bengali. Ieee Access, 9:164681–164699.
  50. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of SemEval.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dhiman Goswami (16 papers)
  2. Md Nishat Raihan (14 papers)
  3. Antara Mahmud (4 papers)
  4. Antonios Anastasopoulos (111 papers)
  5. Marcos Zampieri (94 papers)
Citations (5)