Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data (2402.04542v1)
Abstract: Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.
- Lapca: Language-agnostic pretraining with cross-lingual alignment. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 2098–2102, New York, NY, USA. Association for Computing Machinery.
- Elevating code-mixed text handling through auditory information of words. arXiv preprint arXiv:2310.18155.
- NLP-CIC at SemEval-2020 task 9: Analysing sentiment in code-switching language using a simple deep-learning classifier. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 957–962, Barcelona (online). International Committee for Computational Linguistics.
- Multilingual alignment of contextual word representations. arXiv preprint arXiv:2002.03518.
- Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 4142–4153, New York, NY, USA. Association for Computing Machinery.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Zi-Yi Dou and Graham Neubig. 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2112–2128, Online. Association for Computational Linguistics.
- A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia. Association for Computational Linguistics.
- Multitasking of sentiment detection and emotion recognition in code-mixed hinglish data. Knowledge-Based Systems, 260:110182.
- Chandan Prasad Gupta and Bal Krishna Bal. 2015. Detecting sentiment in nepali texts: A bootstrap approach for sentiment analysis of texts in the nepali language. In 2015 international conference on cognitive computing and information processing (ccip), pages 1–4. IEEE.
- Sentiment analysis on the impact of k-12 program in the philippines using naïve bayes and lexicon approach with code switching. In Proceedings of the 2017 International Conference on Information Technology, ICIT ’17, page 103–106, New York, NY, USA. Association for Computing Machinery.
- SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1627–1643, Online. Association for Computational Linguistics.
- Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Cross-lingual alignment methods for multilingual BERT: A comparative study. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 933–942, Online. Association for Computational Linguistics.
- CS-embed at SemEval-2020 task 9: The effectiveness of code-switched word embeddings for sentiment analysis. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 922–927, Barcelona (online). International Committee for Computational Linguistics.
- Multilingual bert-based word alignment by incorporating common chinese characters. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(6).
- Kk2018 at SemEval-2020 task 9: Adversarial training for code-mixing sentiment classification. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 817–823, Barcelona (online). International Committee for Computational Linguistics.
- Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
- Aksharantar: Open Indic-language transliteration datasets and models for the next billion users. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 40–57, Singapore. Association for Computational Linguistics.
- Developing language-tagged corpora for code-switching tweets. In Proceedings of the 9th Linguistic Annotation Workshop, pages 72–84, Denver, Colorado, USA. Association for Computational Linguistics.
- Efficient estimation of word representations in vector space. International Conference on Learning Representations.
- Linguistic taboos and euphemisms in nepali. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(6):1–26.
- Niraj Pahari and Kazutaka Shimada. 2022. Multi-task learning using bert with soft parameter sharing between layers. In 2022 Joint 12th International Conference on Soft Computing and Intelligent Systems and 23rd International Symposium on Advanced Intelligent Systems (SCIS&ISIS), pages 1–6.
- Niraj Pahari and Kazutaka Shimada. 2023. Language preference for expression of sentiment for Nepali-English bilingual speakers on social media. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 23–32, Singapore. Association for Computational Linguistics.
- Niraj Pahari and Kazutaka Shimada. 2024. Layer configurations of bert for multitask learning and data augmentation. Journal of Advanced Computational Intelligence and Intelligent Informatics, 28(1):29–40.
- SemEval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 774–790, Barcelona (online). International Committee for Computational Linguistics.
- How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
- Sentiment analysis in nepali: Exploring machine learning and lexicon-based approaches. Journal of Intelligent & Fuzzy Systems, 39(2):2201–2212.
- Analyzing facts and opinions in nepali subjective texts. In 2017 8th international conference on information, intelligence, systems & applications (iisa), pages 1–4. IEEE.
- Tej Bahadur Shahi and Chiranjibi Sitaula. 2022. Natural language processing for nepali text: a review. Artificial Intelligence Review, pages 1–29.
- Small bots, big impact: Solving the conundrum of cooperation in optional prisoner’s dilemma game through simple strategies. Journal of the Royal Society Interface, 20:20230301.
- Aspect based abusive sentiment detection in nepali social media texts. In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 301–308. IEEE.
- Anirudh Srinivasan. 2020. MSR India at SemEval-2020 task 9: Multilingual models can do code-mixing too. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 951–956, Barcelona (online). International Committee for Computational Linguistics.
- Aspect based sentiment analysis of nepali text using support vector machine and naive bayes. Technical Journal, 2(1):22–29.
- Lal Bahadur Reshmi Thapa and Bal Krishna Bal. 2016. Classifying sentiments in nepali subjective texts. In 2016 7th International conference on information, intelligence, systems & applications (IISA), pages 1–6. IEEE.
- Attention is all you need. Advances in neural information processing systems, 30.
- Sentiment analysis on monolingual, multilingual and code-switching twitter corpora. In Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis, pages 2–8.
- Cross-lingual ability of multilingual bert: An empirical study. arXiv preprint arXiv:1912.07840.
- Dual-encoder transformers with cross-modal alignment for multimodal aspect-based sentiment analysis. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 414–423, Online only. Association for Computational Linguistics.
- Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1934–1945, Copenhagen, Denmark. Association for Computational Linguistics.
- Niraj Pahari (1 paper)
- Kazutaka Shimada (3 papers)