BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs (1704.06125v1)

Published 20 Apr 2017 in cs.CL and stat.ML

Abstract: In this paper we describe our attempt at producing a state-of-the-art Twitter sentiment classifier using Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTMs) networks. Our system leverages a large amount of unlabeled data to pre-train word embeddings. We then use a subset of the unlabeled data to fine tune the embeddings using distant supervision. The final CNNs and LSTMs are trained on the SemEval-2017 Twitter dataset where the embeddings are fined tuned again. To boost performances we ensemble several CNNs and LSTMs together. Our approach achieved first rank on all of the five English subtasks amongst 40 teams.

Citations (224)

View on Semantic Scholar

Summary

The paper presents a novel ensemble of CNNs and LSTMs that outperformed 40 teams in SemEval-2017 Task 4.
It employs a three-stage training strategy—unsupervised, distant, and supervised—to optimize sentiment classification from tweets.
The system's innovative use of data augmentation with Word2vec and FastText embeddings significantly improved prediction accuracy.

Twitter Sentiment Analysis with CNNs and LSTMs: Insights from SemEval-2017 Task 4

In this paper, Mathieu Cliche presents a sophisticated system for Twitter sentiment analysis that leverages Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) networks. The system achieved top performance across all English subtasks in the SemEval-2017 Task 4 competition, outperforming 40 other teams. It provides an exemplary case paper in combining deep learning methods with a large-scale data augmentation approach to enhance sentiment classification of tweets.

System Architecture

The architecture consists of CNNs and LSTMs configured to handle raw tweets and predict sentiment with high accuracy. The CNN component draws inspiration from existing work, resembling the architecture proposed by Kim (2014) with minor modifications. It processes input tweets encoded as word embeddings, followed by convolution and max-pooling operations to capture relevant n-gram features across the tweets. The LSTM component employs a bi-directional approach to incorporate context from both directions in a tweet, effectively handling the sequential nature of language data.

The ensemble method utilized in the system combines results from multiple CNN and LSTM models, each trained with varying hyperparameters and distinct embedding pre-training algorithms, namely Word2vec and FastText. This ensemble approach mitigates variance in predictions and substantially enhances performance.

Data and Training Strategy

The training process is methodically staggered across three stages: unsupervised, distant, and supervised training. Initially, word embeddings are trained using unsupervised algorithms on a colossal dataset of 100 million unlabeled tweets. Subsequently, these pre-trained embeddings are fine-tuned through distant supervision, employing tweets labeled by emoticons to infuse them with sentiment characteristics. Finally, the CNN and LSTM models are trained on human-labeled data from previous SemEval competition datasets.

Critical to the system's success are several data handling and training enhancements. Preprocessing steps and innovative subtask-specific training strategies, such as handling the target topic in tweets for topic-based subtasks, further boost performance. The utilization of class-weighting in loss functions and dropout mitigates overfitting and counteracts class imbalance.

Results and Analysis

The ensemble achieved distinguished recognition by attaining the first rank in all subtasks of the SemEval-2017 Task 4 competition with outstanding metrics. For instance, the system achieved a macro-averaged recall of 0.681 for subtask A, with performance across other subtasks similarly surpassing competitive benchmarks.

Table evaluations from historical datasets from 2013 to 2016 revealed the potential for ensemble learning to bolster individual model accuracy. Correlation matrices demonstrated the efficacy of combining variably trained models to provide comprehensive sentiment analysis outputs with minimized errors.

Implications for Future Research

The integration of CNNs and LSTMs showcases a robust system for sentiment analysis that augments deep learning's flexibility with the specificity of ensemble methods. Future endeavors might explore architectures that synergize CNN and LSTM capabilities into unified models, potentially emulating the structure of recent hybrid models that engage more deeply with sequential data properties.

Moreover, understanding the optimal scales of unlabeled and distantly supervised data could refine training processes, potentially streamlining training resources while preserving accuracy. The exploration into topic-aware embeddings holds promise for improvements in context-sensitive sentiment tasks.

By advancing methodologies in sentiment analysis, systems like those described in this paper highlight the practicality and enduring need for precise social media text analysis as its applications expand within and beyond academic domains.

PDF Markdown