Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text (2006.00206v1)

Published 30 May 2020 in cs.CL

Abstract: Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Citations (270)

View on Semantic Scholar

Summary

The paper introduces TamilMixSentiment, a sentiment-annotated corpus of 15,744 code-mixed YouTube comments that fills a critical gap in low-resource language datasets.
The annotation framework harnesses crowdsourcing to achieve moderate inter-annotator reliability (Krippendorff's alpha of 0.6), ensuring consistent sentiment tagging.
Experimental analyses show that logistic regression and random forest models outperform other techniques, underscoring the challenges posed by class imbalance in sentiment distribution.

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

The paper "Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text" embarks on the development of a sentiment-annotated corpus designed specifically for the code-mixed Tamil-English language, commonly referred to as Tanglish. This resource is notably significant, addressing the gap in available annotated data for low-resourced language pairs like Tamil-English in multilingual contexts such as social media.

Overview of Corpus Construction

The researchers introduce TamilMixSentiment, a dataset consisting of 15,744 YouTube comment posts, representing an adeptly curated corpus for sentiment analysis. This dataset serves as a gold standard for the sentiment polarity of code-mixed Tanglish text, reflecting the lexical characteristics persisting within bilingual communities. Utilizing YouTube's commonly mixed language comments, the dataset predominantly contains sentences in Roman script, mixed with either Tamil or English expressions. Corpus creation involved a meticulous filtering process, ensuring non-Tamil and non-English content was excluded, resulting in a cleaner dataset focused on genuine code-mixed scenarios.

Annotation Framework

A key element of this paper is the innovative annotation framework linked with crowdsourcing techniques to deal with comment data. A noteworthy aspect is the inter-annotator agreement, which is measured using Krippendorff's alpha, showcasing a moderate reliability of 0.6, suggesting consistent annotation quality. The guidelines for annotation facilitated meaningful tagging of the sentiment while accommodating the complexities arising from code-switching in bilingual texts.

Experimental Setup and Results

The paper reports an experimental analysis using various machine learning techniques including logistic regression, random forest, decision trees, and neural networks among others. Consistently, the logistic regression and random forest models yielded better results with a higher F-score as compared to other models, owing to their robustness against the imbalanced nature of the dataset which skews heavily towards positive sentiments.

The researchers highlight the challenges extending from an imbalanced dataset, where positive sentiments are overwhelmingly more present than negative or neutral sentiments. This class imbalance is central to explaining the skew in model performance, particularly affecting the recall and precision for minority classes.

Implications and Future Directions

The implications of this research go beyond the scope of this single dataset. This corpus fills a critical void in computational linguistics, allowing advancements in NLP for code-mixed languages. It paves the way for the development of more sophisticated algorithms capable of processing multilingual content, inherently common in social networks and digital communications.

In a theoretical context, this research underscores the importance of considering linguistic nuances present in code-switching, calling attention to future work aimed at enhancing model architectures that are cognizant of bilingual contexts. There lies immense potential in employing advanced embedding techniques or leveraging meta-learning strategies to improve sentiment classification.

The work posited by the authors invites expansive discourse surrounding the adequacy and practicability of model deployment in low-resource language contexts and beckons future research to transcend merely dataset construction towards solving intrinsic linguistic challenges present in code-mixed scenarios.

In conclusion, the effort to create a sentiment-annotated code-mixed dataset such as TamilMixSentiment exemplifies a significant contribution to the landscape of multilingual sentiment analysis, opening channels for persuasive developments in this space, both theoretically and practically.