2000 character limit reached
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification (2310.18387v2)
Published 27 Oct 2023 in cs.CL and cs.AI
Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages. We experiment with several models on this dataset and observe that BanglishBERT outperforms other transformer-based models and GPT-3.5.
- Suraiya Alam. 2006. Code-mixing in bangladesh: A case study of non-government white-collar service holders and professionals. Asian affairs, 28(4):52–70.
- Effects of code mixing in indian film songs. Journal of Media Studies, 31(2).
- Fotini Anastassiou. 2017. Factors associated with the code mixing and code-switching of multilingual children: An overview. International Journal of Linguistics, Literature and Culture, 4(3):13–26.
- “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of CodeSwitch.
- Most Tasnim Begum and Md Mahmudul Haque. 2013. Code mixing in the ksa: A case study of expatriate bangladeshi and indian esl teachers. Arab World English Journal, 4(4).
- BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the ACL.
- A dataset of hindi-english code-mixed social media text for hate speech detection. In Proceedings of PEOPLES.
- Krista Byers-Heinlein and Casey Lew-Williams. 2013. Bilingualism in the early years: What the science says. LEARNing landscapes, 7(1):95.
- HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of WOAH.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
- Databricks. 2023. Dolly 2.0: An open source, instruction-following large language model. Accessed: 2023-09-10.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
- Hindi-english code mixed hate speech detection using character level embeddings. In Proceedings of ICCMC.
- Reviewing the challenges and opportunities presented by code switching and mixing in bangla. Journal of Education and Practice, 6(1):103–109.
- Didar Hossain and Kapil Bar. 2015. A case study in code-mixing among jahangirnagar university students. International Journal of English and Literature, 6(7):123–139.
- Abusive comments detection in bangla-english code-mixed and transliterated text. In Proceedings of ICIET.
- Jill V Jeffery and Catherine van Beuningen. 2020. Language education in the eu and the us: Paradoxes and parallels. Prospects, 48(3-4):175–191.
- A survey of current datasets for code-switching research. In Proceedings of ICACCS.
- Fasttext.zip: Compressing text classification models. CoRR, abs/1612.03651.
- Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the ACL.
- Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
- Harnessing code switching to transcend the linguistic barrier. In Proceedings of IJCAI.
- Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
- Cross-lingual text classification of transliterated hindi and malayalam. In Proceedings of Big Data.
- Evaluating aggression identification in social media. In Proceedings of TRAC.
- Aggression-annotated Corpus of Hindi-English Code-mixed Data. In Proceedings of LREC.
- Kirti Kumari and Jyoti Prakash Singh. 2020. Ai_ml_nit_patna@ trac-2: Deep learning approach for multi-lingual aggression identification. In Proceedings of TRAC.
- Tarald O Kvålseth. 1989. Note on cohen’s kappa. Psychological reports, 65(1):223–226.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Ahmad Mahbub-ul Alam and Shaima Quyyum. 2016. A sociolinguistic survey on code switching & code mixing by the native speakers of bangladesh. Journal of Manarat International University, 6(1):8–9.
- Daniele Mazzocchi. 2012. langdetect: Language detection library. Python library.
- Shikha Mundra and Namita Mittal. 2022. Fa-net: fused attention-based network for hindi english code-mixed offensive text classification. Social Network Analysis and Mining, 12(1):100.
- Ravindra Nayak and Raviraj Joshi. 2022. L3Cube-HingCorpus and HingBERT: A code mixed Hindi-English dataset and BERT language models. In Proceedings of WILDRE.
- Nick Doiron. 2023. hindi-bert. Accessed: 2023-09-10.
- Jianzhi Nie. 2023. Awesome instruction datasets. Accessed: 2023-09-10.
- OpenAI. 2023. Gpt-3.5 turbo fine-tuning and api updates. Accessed: 2023-08-28.
- Tharindu Ranasinghe and Marcos Zampieri. 2021. An evaluation of multilingual offensive language identification methods for the languages of india. Information, 12(8):306.
- A comparative study of different state-of-the-art hate speech detection methods in hindi-english code-mixed data. In Proceedings of TRAC.
- Manikandan Ravikiran and Subbiah Annamalai. 2021. Dosa: Dravidian code-mixed offensive span identification dataset. In Proceedings of TRAC.
- SOLID: A large-scale semi-supervised dataset for offensive language identification. In Findings of the ACL.
- Siva Sai and Yashvardhan Sharma. 2020. Siva@ hasoc-dravidian-codemix-fire-2020: Multilingual offensive speech detection in code-mixed and romanized text. In Proceedings of FIRE.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of EMC2.
- Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of AdaptNLP.
- fBERT: A neural transformer for identifying offensive content. In Findings of the ACL.
- Rajendra Singh. 1985. Grammatical constraints on code-mixing: Evidence from hindi-english. Canadian Journal of Linguistics/Revue canadienne de linguistique, 30(1):33–45.
- Detection of hate speech text in hindi-english code-mixed data. Procedia Computer Science, 171:737–744.
- Code-mixing: A brief survey. In Proceedings of ICACCI.
- Charangan Vasantharajan and Uthayasanker Thayasivam. 2021. Hypers@ dravidianlangtech-eacl2021: Offensive language identification in dravidian code-mixed youtube comments and posts. In Proceedings of DravidianLangTech.
- L-boost: Identifying offensive texts from social media post in bengali. Ieee Access, 9:164681–164699.
- SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of SemEval.
- Dhiman Goswami (16 papers)
- Md Nishat Raihan (14 papers)
- Antara Mahmud (4 papers)
- Antonios Anastasopoulos (111 papers)
- Marcos Zampieri (94 papers)