Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis (2310.18023v2)

Published 27 Oct 2023 in cs.CL

Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Suraiya Alam. 2006. Code-mixing in bangladesh: A case study of non-government white-collar service holders and professionals. Asian affairs, 28(4):52–70.
  2. Effects of code mixing in indian film songs. Journal of Media Studies, 31(2).
  3. Fotini Anastassiou. 2017. Factors associated with the code mixing and code-switching of multilingual children: An overview. International Journal of Linguistics, Literature and Culture, 4(3):13–26.
  4. “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of CodeSwitch.
  5. Most Tasnim Begum and Md Mahmudul Haque. 2013. Code mixing in the ksa: A case study of expatriate bangladeshi and indian esl teachers. Arab World English Journal, 4(4).
  6. BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the ACL.
  7. Krista Byers-Heinlein and Casey Lew-Williams. 2013. Bilingualism in the early years: What the science says. LEARNing landscapes, 7(1):95.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
  9. Databricks. 2023. Dolly 2.0: An open source, instruction-following large language model. Accessed: 2023-09-10.
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
  11. Reviewing the challenges and opportunities presented by code switching and mixing in bangla. Journal of Education and Practice, 6(1):103–109.
  12. Didar Hossain and Kapil Bar. 2015. A case study in code-mixing among jahangirnagar university students. International Journal of English and Literature, 6(7):123–139.
  13. Jill V Jeffery and Catherine van Beuningen. 2020. Language education in the eu and the us: Paradoxes and parallels. Prospects, 48(3-4):175–191.
  14. A survey of current datasets for code-switching research. In Proceedings of ICACCS.
  15. Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of COLING.
  16. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the ACL.
  17. Sentiment analysis of covid-19 vaccination in bangla language with code-mixed text from social media. In Proceedings of ICECE.
  18. Muril: Multilingual representations for indian languages.
  19. Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
  20. Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. In Proceedings of the 1st Workshop on Multilingual Representation Learning.
  21. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  22. Ahmad Mahbub-ul Alam and Shaima Quyyum. 2016. A sociolinguistic survey on code switching & code mixing by the native speakers of bangladesh. Journal of Manarat International University, 6(1):8–9.
  23. Daniele Mazzocchi. 2012. langdetect: Language detection library. Python library.
  24. Ravindra Nayak and Raviraj Joshi. 2022. L3Cube-HingCorpus and HingBERT: A code mixed Hindi-English dataset and BERT language models. In Proceedings WILDRE.
  25. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of EMNLP.
  26. Nick Doiron. 2023. hindi-bert (revision aefac8e).
  27. Jianzhi Nie. 2023. Awesome instruction datasets. Accessed: 2023-09-10.
  28. OpenAI. 2023. Gpt-3.5 turbo fine-tuning and api updates. Accessed: 2023-08-28.
  29. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
  30. Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of AdaptNLP.
  31. Rajendra Singh. 1985. Grammatical constraints on code-mixing: Evidence from hindi-english. Canadian Journal of Linguistics/Revue canadienne de linguistique, 30(1):33–45.
  32. Sentiment analysis of mixed language employing hindi-english code switching. In Proceedings of ICMLC.
  33. Data-augmentation for bangla-english code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding. IEEE Access.
  34. Code-mixing: A brief survey. In Proceedings of ICACCI.
  35. Bi-lstm and ensemble based bilingual sentiment analysis for a code-mixed hindi-english social media text. In Proceedings of INDICON.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Md Nishat Raihan (14 papers)
  2. Dhiman Goswami (16 papers)
  3. Antara Mahmud (4 papers)
  4. Antonios Anastasopoulos (111 papers)
  5. Marcos Zampieri (94 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.