Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi (2309.10272v2)

Published 19 Sep 2023 in cs.CL

Abstract: One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. A complete bangla optical character recognition system: An effective approach. In 2019 22nd International Conference on Computer and Information Technology (ICCIT) (2019), IEEE, pp. 1–7.
  2. Code mixing: A challenge for language identification in the language of social media. In Proceedings of the first workshop on computational approaches to code switching (2014), pp. 13–23.
  3. Grammatical constraints on intra-sentential code-switching: From theories to working models. arXiv preprint arXiv:1612.04538 (2016).
  4. Banglabert: Language model pretraining and benchmarks for low-resource language understanding evaluation in bangla. arXiv preprint arXiv:2101.00204 (2021).
  5. HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) (Online, Aug. 2021), Association for Computational Linguistics, pp. 17–25.
  6. Unraveling the english-bengali code-mixing phenomenon. In Proceedings of the second workshop on computational approaches to code switching (2016), pp. 80–89.
  7. Cross-lingual language model pretraining. Advances in neural information processing systems 32 (2019).
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  9. Masontigers@ lt-edi-2024: An ensemble approach towards detecting homophobia and transphobia in social media comments. arXiv preprint arXiv:2401.14681 (2024).
  10. Offmix-3l: A novel code-mixed test dataset in bangla-english-hindi for offensive language identification. In Proceedings of the 11th International Workshop on Natural Language Processing for Social Media (2023), pp. 21–27.
  11. nlpbdpatriots at blp-2023 task 2: A transfer learning approach to bangla sentiment analysis. arXiv preprint arXiv:2311.15032 (2023).
  12. Joshi, R. L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages. arXiv preprint arXiv:2211.11418 (2022).
  13. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of EMNLP (2020).
  14. Muril: Multilingual representations for indian languages, 2021.
  15. Emoberta: Speaker-aware emotion recognition in conversation with roberta. CoRR abs/2108.12009 (2021).
  16. Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. arXiv preprint arXiv:2103.07792 (2021).
  17. An android-based useful text extraction framework using image and natural language processing. International Journal of Computer Theory and Engineering 10, 3 (2018), 77–83.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  19. L3cube-hingcorpus and hingbert: A code mixed hindi-english dataset and bert language models. arXiv preprint arXiv:2204.08398 (2022).
  20. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (2019), pp. 188–197.
  21. Word embeddings for code-mixed language processing. In Proceedings of the 2018 conference on empirical methods in natural language processing (2018), pp. 3067–3072.
  22. Sentmix-3l: A novel code-mixed test dataset in bangla-english-hindi for sentiment analysis. In Proceedings of the First Workshop in South East Asian Language Processing (2023), pp. 79–84.
  23. nlpbdpatriots at blp-2023 task 1: A two-step classification for violence inciting text detection in bangla. arXiv preprint arXiv:2311.15029 (2023).
  24. Offensive language identification in transliterated and code-mixed bangla. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023) (2023), pp. 1–6.
  25. An experimental analysis on the sensitivity of the most widely used edge detection methods to different noise types. In Proceedings of the 2nd International Conference on Computing Advancements (2022), pp. 377–383.
  26. Solid: A large-scale semi-supervised dataset for offensive language identification. arXiv preprint arXiv:2004.14454 (2020).
  27. Improved sentiment detection via label transfer from monolingual to synthetic code-switched text. arXiv preprint arXiv:1906.05725 (2019).
  28. Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of the Second Workshop on Domain Adaptation for NLP (2021), pp. 111–121.
  29. fBERT: A neural transformer for identifying offensive content. In Findings of the Association for Computational Linguistics: EMNLP 2021 (Punta Cana, Dominican Republic, Nov. 2021), Association for Computational Linguistics, pp. 1792–1798.
  30. A monolingual approach to contextualized word embeddings for mid-resource languages. arXiv preprint arXiv:2006.06202 (2020).
  31. TechSlang. Perplexity in nlp: Definition, pros, and cons, 2021.
  32. Code-mixing: A brief survey. In 2018 International conference on advances in computing, communications and informatics (ICACCI) (2018), IEEE, pp. 2382–2388.
  33. Weinberg, B. Transliteration in documentation. Journal of documentation (1974).
  34. Alternating language modeling for cross-lingual pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 9386–9393.
  35. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).
  36. Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs] (Sept. 2015).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Md Nishat Raihan (14 papers)
  2. Dhiman Goswami (16 papers)
  3. Antara Mahmud (4 papers)
Citations (1)