- The paper introduces Subword-LSTM to address challenges in sentiment analysis on noisy Hindi-English code-mixed text.
- It employs sub-word level compositions with 1-D convolutions to extract morpheme-like features, achieving a 4-5% improvement over conventional methods.
- The study presents a curated annotated Hi-En dataset, demonstrating the model's effectiveness with nearly 70% accuracy and an F1-score of 0.658.
Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code-Mixed Text
The paper "Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text" addresses the challenges posed by code-mixed social media data prevalent in multilingual societies like India. The authors introduce a novel dataset consisting of Hindi-English (Hi-En) code-mixed text and propose a new deep learning model, Subword-LSTM, designed to perform sentiment analysis with enhanced accuracy on such data.
In multilingual settings where code-mixing is common, the fusion of multiple languages within a single utterance complicates traditional NLP tasks. The authors identify the absence of a suitable annotated dataset as a major barrier to advancing sentiment analysis in such contexts. To address this, they create a comprehensive Hi-En code-mixed dataset, which is systematically annotated for sentiment polarity.
Key Contributions and Methodology
The paper makes several significant contributions:
- Dataset Creation and Annotation: The authors curate a Hi-En dataset primarily sourced from public Facebook pages, annotated for sentiment polarity. This dataset is critical for evaluating sentiment analysis methods tailored for code-mixed data. It is characterized by short, noisy sentences with non-standard spellings and grammar.
- Subword-LSTM Architecture: The primary methodological innovation is the Subword-LSTM, which utilizes sub-word level representations instead of traditional character or word-level approaches. By adopting 1-D convolutions on character inputs to generate morpheme-like feature maps, the model learns to extract sentiment information effectively, even in the presence of noisy and misspelled text.
- Performance Evaluation: The model exhibits significant performance improvements, achieving 4-5% higher accuracy compared to conventional approaches and outperforming existing systems for Hi-En code-mixed text sentiment analysis by an impressive 18%.
Experimental Insights
A comparative paper reveals that standard sentiment analysis techniques such as word embeddings or parse-tree based methods are ill-suited to handle the high sparseness and linguistic heterogeneity of code-mixed data. The Subword-LSTM's ability to leverage sub-word features leads to superior sentiment analysis performance. Accuracy on the Hi-En dataset stands at nearly 70%, with an F1-score of 0.658, demonstrating the architecture's effectiveness.
Implications and Future Directions
The introduction of the Subword-LSTM shows promise for sentiment analysis in multilingual, code-mixed environments by harnessing linguistic priors encoded in sub-word structures. This approach not only addresses the vocabulary diversity and sparseness issues inherent to such data but also sets the stage for similar applications in other code-mixed languages beyond Hindi-English.
Future research could delve into more expansive datasets and explore scaling the Subword-LSTM architecture for enhanced generalization and performance. Further experimentation with deeper neural architectures could also advance sentiment analysis techniques for a broad array of noisy and code-mixed language data prevalent across social media platforms.