A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Published 30 May 2020 in cs.CL | (2006.00210v1)

Abstract: There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

Abstract PDF Upgrade to Chat

Citations (207)

View on Semantic Scholar

Summary

The paper introduces a novel Malayalam-English code-mixed corpus for sentiment analysis with detailed annotation and high inter-annotator agreement exceeding 0.8.
It employs a robust methodology including filtering, tokenization, and model benchmarking, with BERT achieving superior classification performance.
The dataset fills a critical gap in multilingual NLP, enabling further research into sociolinguistic phenomena and sentiment analysis in underrepresented languages.

Insights on "A Sentiment Analysis Dataset for Code-Mixed Malayalam-English"

The paper "A Sentiment Analysis Dataset for Code-Mixed Malayalam-English" addresses the growing need for sentiment analysis resources that cater to the increasingly prevalent phenomenon of code-mixing in multilingual social media communications. This research contributes significantly to the field of NLP by presenting a new gold standard corpus specifically designed for Malayalam-English code-mixed text — a language pair for which no such dataset previously existed.

Overview of Contributions

The authors introduce a corpus tailored for Malayalam-English code-mixed sentiment analysis and elaborate on its collection and annotation process. The focus on Malayalam is particularly relevant due to its status as a major language in the Dravidian family, with a substantial speaker base across India and other countries. Notably, due to the intricate and agglutinative nature of the Malayalam language, the creation of code-mixed datasets presents unique challenges compared to more widely studied language pairs.

Corpus Creation and Annotation Process

Corpus Compilation: The dataset was compiled from user comments on Malayalam movie trailers from YouTube. The choice of social media as a data source is strategic, given its rich repository of informal, multilingual exchanges.
Filtering and Preprocessing: A preliminary filtering step ensured the exclusion of monolingual comments, focusing strictly on code-mixed content. Specific preprocessing steps included tokenization and the exclusion of comments based purely on Malayalam script to maintain a consistent code-mixed framework.
Annotation Protocol: The sentiment labels assigned to the data were Positive, Negative, Mixed Feelings, Neutral, and Not in intended language. The annotation was conducted by proficient bilingual speakers and followed a structured protocol to ensure high inter-annotator agreement, evidenced by Krippendorff's alpha exceeding 0.8.

Experimental Evaluation

To benchmark the proposed dataset, the authors employed various machine learning and deep learning models. Traditional models such as Logistic Regression and Support Vector Machines were benchmarked alongside advanced approaches using Dynamic Meta-Embeddings (DME), Contextualized DME (CDME), 1D Dimensional Convolution (1DConv), and BERT.

Key Findings

Performance Metrics: BERT emerged as the most effective model, achieving superior classification metrics across the dataset, underscoring the potency of transfer learning in understanding code-mixed text complexities. The use of pre-trained embeddings proved essential in improving model performance by leveraging contextualized and dynamic word representations.
Benchmark Results: This benchmarking effort establishes a pivotal reference point for future studies in code-mixed sentiment analysis involving Malayalam-English texts, thereby catalyzing further advancements in multilingual NLP technologies.

Implications and Future Directions

This dataset fills a critical gap by providing a robust testbed for the development and evaluation of sentiment analysis models tailored to code-mixed languages. The corpus can significantly enhance the scope of studies in sociolinguistic phenomena and NLP applications in underrepresented language pairs. Practically, this resource has potential applications in real-time sentiment analysis for businesses, media influencers, and policy makers seeking insights from multilingual communities.

Future work could build upon this study by extending the dataset to incorporate more sophisticated syntactic and semantic features of code-mixed languages, including discourse-level annotation, and applying it to a broader set of languages and dialects. Additionally, exploring unsupervised and semi-supervised learning paradigms could further improve sentiment classification performance in resource-scarce contexts. This line of research ultimately aims to foster more effective cross-cultural communication and understanding through advanced computational techniques.

Markdown