2000 character limit reached
SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis (2310.18023v2)
Published 27 Oct 2023 in cs.CL
Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
- Suraiya Alam. 2006. Code-mixing in bangladesh: A case study of non-government white-collar service holders and professionals. Asian affairs, 28(4):52–70.
- Effects of code mixing in indian film songs. Journal of Media Studies, 31(2).
- Fotini Anastassiou. 2017. Factors associated with the code mixing and code-switching of multilingual children: An overview. International Journal of Linguistics, Literature and Culture, 4(3):13–26.
- “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of CodeSwitch.
- Most Tasnim Begum and Md Mahmudul Haque. 2013. Code mixing in the ksa: A case study of expatriate bangladeshi and indian esl teachers. Arab World English Journal, 4(4).
- BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the ACL.
- Krista Byers-Heinlein and Casey Lew-Williams. 2013. Bilingualism in the early years: What the science says. LEARNing landscapes, 7(1):95.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
- Databricks. 2023. Dolly 2.0: An open source, instruction-following large language model. Accessed: 2023-09-10.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
- Reviewing the challenges and opportunities presented by code switching and mixing in bangla. Journal of Education and Practice, 6(1):103–109.
- Didar Hossain and Kapil Bar. 2015. A case study in code-mixing among jahangirnagar university students. International Journal of English and Literature, 6(7):123–139.
- Jill V Jeffery and Catherine van Beuningen. 2020. Language education in the eu and the us: Paradoxes and parallels. Prospects, 48(3-4):175–191.
- A survey of current datasets for code-switching research. In Proceedings of ICACCS.
- Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of COLING.
- Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the ACL.
- Sentiment analysis of covid-19 vaccination in bangla language with code-mixed text from social media. In Proceedings of ICECE.
- Muril: Multilingual representations for indian languages.
- Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
- Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. In Proceedings of the 1st Workshop on Multilingual Representation Learning.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Ahmad Mahbub-ul Alam and Shaima Quyyum. 2016. A sociolinguistic survey on code switching & code mixing by the native speakers of bangladesh. Journal of Manarat International University, 6(1):8–9.
- Daniele Mazzocchi. 2012. langdetect: Language detection library. Python library.
- Ravindra Nayak and Raviraj Joshi. 2022. L3Cube-HingCorpus and HingBERT: A code mixed Hindi-English dataset and BERT language models. In Proceedings WILDRE.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of EMNLP.
- Nick Doiron. 2023. hindi-bert (revision aefac8e).
- Jianzhi Nie. 2023. Awesome instruction datasets. Accessed: 2023-09-10.
- OpenAI. 2023. Gpt-3.5 turbo fine-tuning and api updates. Accessed: 2023-08-28.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of AdaptNLP.
- Rajendra Singh. 1985. Grammatical constraints on code-mixing: Evidence from hindi-english. Canadian Journal of Linguistics/Revue canadienne de linguistique, 30(1):33–45.
- Sentiment analysis of mixed language employing hindi-english code switching. In Proceedings of ICMLC.
- Data-augmentation for bangla-english code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding. IEEE Access.
- Code-mixing: A brief survey. In Proceedings of ICACCI.
- Bi-lstm and ensemble based bilingual sentiment analysis for a code-mixed hindi-english social media text. In Proceedings of INDICON.
- Md Nishat Raihan (14 papers)
- Dhiman Goswami (16 papers)
- Antara Mahmud (4 papers)
- Antonios Anastasopoulos (111 papers)
- Marcos Zampieri (94 papers)