Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text (1611.00472v1)

Published 2 Nov 2016 in cs.CL

Abstract: Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%.

Citations (184)

View on Semantic Scholar

Summary

The paper introduces Subword-LSTM to address challenges in sentiment analysis on noisy Hindi-English code-mixed text.
It employs sub-word level compositions with 1-D convolutions to extract morpheme-like features, achieving a 4-5% improvement over conventional methods.
The study presents a curated annotated Hi-En dataset, demonstrating the model's effectiveness with nearly 70% accuracy and an F1-score of 0.658.

Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code-Mixed Text

The paper "Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text" addresses the challenges posed by code-mixed social media data prevalent in multilingual societies like India. The authors introduce a novel dataset consisting of Hindi-English (Hi-En) code-mixed text and propose a new deep learning model, Subword-LSTM, designed to perform sentiment analysis with enhanced accuracy on such data.

In multilingual settings where code-mixing is common, the fusion of multiple languages within a single utterance complicates traditional NLP tasks. The authors identify the absence of a suitable annotated dataset as a major barrier to advancing sentiment analysis in such contexts. To address this, they create a comprehensive Hi-En code-mixed dataset, which is systematically annotated for sentiment polarity.

Key Contributions and Methodology

The paper makes several significant contributions:

Dataset Creation and Annotation: The authors curate a Hi-En dataset primarily sourced from public Facebook pages, annotated for sentiment polarity. This dataset is critical for evaluating sentiment analysis methods tailored for code-mixed data. It is characterized by short, noisy sentences with non-standard spellings and grammar.
Subword-LSTM Architecture: The primary methodological innovation is the Subword-LSTM, which utilizes sub-word level representations instead of traditional character or word-level approaches. By adopting 1-D convolutions on character inputs to generate morpheme-like feature maps, the model learns to extract sentiment information effectively, even in the presence of noisy and misspelled text.
Performance Evaluation: The model exhibits significant performance improvements, achieving 4-5% higher accuracy compared to conventional approaches and outperforming existing systems for Hi-En code-mixed text sentiment analysis by an impressive 18%.

Experimental Insights

A comparative paper reveals that standard sentiment analysis techniques such as word embeddings or parse-tree based methods are ill-suited to handle the high sparseness and linguistic heterogeneity of code-mixed data. The Subword-LSTM's ability to leverage sub-word features leads to superior sentiment analysis performance. Accuracy on the Hi-En dataset stands at nearly 70%, with an F1-score of 0.658, demonstrating the architecture's effectiveness.

Implications and Future Directions

The introduction of the Subword-LSTM shows promise for sentiment analysis in multilingual, code-mixed environments by harnessing linguistic priors encoded in sub-word structures. This approach not only addresses the vocabulary diversity and sparseness issues inherent to such data but also sets the stage for similar applications in other code-mixed languages beyond Hindi-English.

Future research could delve into more expansive datasets and explore scaling the Subword-LSTM architecture for enhanced generalization and performance. Further experimentation with deeper neural architectures could also advance sentiment analysis techniques for a broad array of noisy and code-mixed language data prevalent across social media platforms.

PDF Markdown