Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi (2309.10272v2)
Abstract: One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.
- A complete bangla optical character recognition system: An effective approach. In 2019 22nd International Conference on Computer and Information Technology (ICCIT) (2019), IEEE, pp. 1–7.
- Code mixing: A challenge for language identification in the language of social media. In Proceedings of the first workshop on computational approaches to code switching (2014), pp. 13–23.
- Grammatical constraints on intra-sentential code-switching: From theories to working models. arXiv preprint arXiv:1612.04538 (2016).
- Banglabert: Language model pretraining and benchmarks for low-resource language understanding evaluation in bangla. arXiv preprint arXiv:2101.00204 (2021).
- HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) (Online, Aug. 2021), Association for Computational Linguistics, pp. 17–25.
- Unraveling the english-bengali code-mixing phenomenon. In Proceedings of the second workshop on computational approaches to code switching (2016), pp. 80–89.
- Cross-lingual language model pretraining. Advances in neural information processing systems 32 (2019).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Masontigers@ lt-edi-2024: An ensemble approach towards detecting homophobia and transphobia in social media comments. arXiv preprint arXiv:2401.14681 (2024).
- Offmix-3l: A novel code-mixed test dataset in bangla-english-hindi for offensive language identification. In Proceedings of the 11th International Workshop on Natural Language Processing for Social Media (2023), pp. 21–27.
- nlpbdpatriots at blp-2023 task 2: A transfer learning approach to bangla sentiment analysis. arXiv preprint arXiv:2311.15032 (2023).
- Joshi, R. L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages. arXiv preprint arXiv:2211.11418 (2022).
- IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. In Findings of EMNLP (2020).
- Muril: Multilingual representations for indian languages, 2021.
- Emoberta: Speaker-aware emotion recognition in conversation with roberta. CoRR abs/2108.12009 (2021).
- Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. arXiv preprint arXiv:2103.07792 (2021).
- An android-based useful text extraction framework using image and natural language processing. International Journal of Computer Theory and Engineering 10, 3 (2018), 77–83.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- L3cube-hingcorpus and hingbert: A code mixed hindi-english dataset and bert language models. arXiv preprint arXiv:2204.08398 (2022).
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (2019), pp. 188–197.
- Word embeddings for code-mixed language processing. In Proceedings of the 2018 conference on empirical methods in natural language processing (2018), pp. 3067–3072.
- Sentmix-3l: A novel code-mixed test dataset in bangla-english-hindi for sentiment analysis. In Proceedings of the First Workshop in South East Asian Language Processing (2023), pp. 79–84.
- nlpbdpatriots at blp-2023 task 1: A two-step classification for violence inciting text detection in bangla. arXiv preprint arXiv:2311.15029 (2023).
- Offensive language identification in transliterated and code-mixed bangla. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023) (2023), pp. 1–6.
- An experimental analysis on the sensitivity of the most widely used edge detection methods to different noise types. In Proceedings of the 2nd International Conference on Computing Advancements (2022), pp. 377–383.
- Solid: A large-scale semi-supervised dataset for offensive language identification. arXiv preprint arXiv:2004.14454 (2020).
- Improved sentiment detection via label transfer from monolingual to synthetic code-switched text. arXiv preprint arXiv:1906.05725 (2019).
- Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of the Second Workshop on Domain Adaptation for NLP (2021), pp. 111–121.
- fBERT: A neural transformer for identifying offensive content. In Findings of the Association for Computational Linguistics: EMNLP 2021 (Punta Cana, Dominican Republic, Nov. 2021), Association for Computational Linguistics, pp. 1792–1798.
- A monolingual approach to contextualized word embeddings for mid-resource languages. arXiv preprint arXiv:2006.06202 (2020).
- TechSlang. Perplexity in nlp: Definition, pros, and cons, 2021.
- Code-mixing: A brief survey. In 2018 International conference on advances in computing, communications and informatics (ICACCI) (2018), IEEE, pp. 2382–2388.
- Weinberg, B. Transliteration in documentation. Journal of documentation (1974).
- Alternating language modeling for cross-lingual pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 9386–9393.
- Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).
- Character-level Convolutional Networks for Text Classification. arXiv:1509.01626 [cs] (Sept. 2015).
- Md Nishat Raihan (14 papers)
- Dhiman Goswami (16 papers)
- Antara Mahmud (4 papers)