AffectiveTweets Regression System
- AffectiveTweets Regression System is a framework that estimates continuous emotion intensity in tweets using innovative machine learning and regression approaches.
- It integrates diverse features including lexica, n-grams, word embeddings, deep neural representations, and stylometric markers to capture affect nuances.
- Evaluation via Pearson correlation on SemEval benchmarks demonstrates its effectiveness for applications in social media analysis and computational psychology.
The AffectiveTweets Regression System encompasses a suite of methodologies for estimating the real-valued intensity of emotions conveyed in tweets, with primary application to data produced for tasks such as the SemEval "Affect in Tweets" challenge. Central to these systems are machine learning pipelines that integrate extensive, heterogeneous feature engineering—including affect lexica, dense word embeddings, deep neural representations, and tweet-specific stylometric markers—and sophisticated regression/classification frameworks, such as L2-regularized SVR, ensemble methods, and mixture-of-experts architectures. The accuracy of such systems is assessed in terms of Pearson correlation (and related metrics) between predicted and gold-standard fine-grained emotion scores, which are typically derived via rigorous annotation schemes like best–worst scaling.
1. Problem Definition and Data Foundation
The core task addressed by AffectiveTweets Regression Systems is the mapping of a tweet to a continuous-valued emotion intensity , either for basic emotions (anger, fear, joy, sadness) or valence, with models trained to minimize squared prediction error. Datasets are generated using emotion-focused sampling, yielding corpora in the order of 7,000 English tweets distributed among these emotions, with careful control over lexical and author overlap. Annotation employs best–worst scaling (BWS), where tweets are presented in maximally diverse 4-tuples to annotators. Annotator judgments are converted to where . Reliability estimates report split-half Pearson between 0.80 and 0.88 depending on emotion (Mohammad et al., 2017).
2. Feature Engineering Paradigms
AffectiveTweets systems extract and fuse diverse feature sets, optimized for lexical, semantic, and syntactic coverage:
- N-gram Features: Tokenized tweet representations include binary indicators for word or character n-grams (unigram up to 4-gram for words; char 3–5-grams), with negations systematically marked.
- Word Embeddings: Dense representations, most notably 400-dim skip-gram word2vec (trained on 10M Twitter messages) or 300-dim GloVe vectors, provide aggregated semantic profiles for each tweet via element-wise averaging.
- Affect Lexicon Features: Scores from up to ten sentiment/affect lexica—e.g., AFINN, Bing Liu, MPQA, NRC-Affect-Intensity, NRC-EmoLex, NRC Hashtag Sentiment—are summed or counted per category for each tweet.
- Deep-Emoji and Neural Features: Pretrained semantically-rich embeddings (attention and softmax activations from Deep-Emoji, skip-thought 4,800-dim sentence vectors, and the 4,096-dim unsupervised sentiment neuron) capture compositional and emotive context, especially for short informal text.
- Stylometric Features: Counts of emoticons, part-of-speech classes, punctuation, average word length, and related measures further characterize tweet style and informality.
- Hashtag Intensity: Mean intensities of hashtagged emotion words (from Depeche Mood dictionary) quantify implicit self-annotation.
All features are concatenated into a high-dimensional tweet vector (, ), standardized, and input to regression/classification components (Oota et al., 2019, Mohammad et al., 2017).
3. Modeling and Learning Architectures
Single Model Regression
Early reference implementations employ L2-regularized L2-loss SVR (LIBLINEAR), optimizing
with tuned to maximize held-out Pearson (Mohammad et al., 2017). Alternative baselines include unigrams, n-grams, or embeddings as standalone feature sets.
Mixture-of-Experts ("Experts Model")
Recent advances center on a Mixture-of-Experts (MoE) ensemble, where experts (e.g., Gradient Boosting, XGBoost, LightGBM, Random Forest, shallow neural network) are independently pre-trained. An -conditioned gating network (parameterized by ) outputs softmax-normalized weights :
Final predictions are a convex combination:
Gating parameters are optimized (with expert weights fixed) to minimize the expectation of a per-sample error:
Training employs gradient descent; stratified cross-validation and grid search are used for expert hyperparameters (Oota et al., 2019).
4. Evaluation Procedures and Benchmarking
For regression and ordinal classification tasks, Pearson’s correlation is the principal evaluation metric, with both full-range (all ) and moderate-to-high intensity () performance monitored. The methodology is grounded in the SemEval-2018 "Affect in Tweets" challenge:
- Subtasks: EI-reg (regression for emotion intensity), EI-oc (ordinal), V-reg (valence regression), V-oc (valence ordinal), E-c (multi-label emotion classification).
- Dataset Splits: Thousands of tweets per emotion, with independent train, dev, and test splits. Each tweet independently scored for each emotion.
- Baselines and Comparisons: Official SVM-unigram baselines serve as reference; the Experts Model consistently outperforms by 20–30 correlation points on regression, and 10–15 points in accuracy/F1 for classification (Oota et al., 2019).
| Subtask | Experts Model Score | Top Performer | Baseline |
|---|---|---|---|
| EI-reg (macro ) | 0.753 (5th/48) | 0.799 | approx. 0.52 |
| EI-oc (macro ) | 0.636 (5th) | 0.695 | approx. 0.47 |
| V-reg | 0.830 (7th) | – | – |
| V-oc | 0.738 (10th) | – | – |
| E-c Jaccard | 0.578 (3rd) | – | – |
Performance for individual features: Deep-Emoji features yield for anger/fear, skip-thoughts and affect lexica follow, with stylometric and “unsupervised sentiment neuron” features least predictive (–$0.30$).
5. Ablation, Analysis, and Linguistic Insights
Feature ablation and analytical studies indicate that:
- Affect lexicon features confer the largest single boost ( over embeddings alone).
- Embeddings and lexicon features are synergistic: WE+L achieves average, compared to or lower for individual features (Mohammad et al., 2017).
- Removing n-grams has minimal effect once both lexicons and embeddings are present.
- Hashtag analysis: trailing emotion hashtags increase perceived intensity in 78.6% of instances (mean intensity with hashtag 0.58 vs. 0.47 without; Wilcoxon ); the impact is generally positive but context-dependent.
- Cross-emotion transfer: models trained on negative emotions generalize better between themselves, whereas negative-to-positive transfer (e.g., anger joy) yields negative or near-zero Pearson (Mohammad et al., 2017).
6. Systemic Impact and Applications
AffectiveTweets Regression Systems have been extensively evaluated in shared tasks (SemEval-2018 Task 1, etc.), setting a de facto benchmark for emotion intensity prediction in microtext. Incorporation of pretrained deep semantic features alongside domain-specific lexica enables these systems to capture nuanced, fine-grained affect expressions otherwise inaccessible to traditional sentiment analysis. They support applications in social media mining, computational psychology, e-retail, and market analytics, where precise quantification of emotion intensity is essential for downstream reasoning, tracking affect-laden trends, and human-centric decision support (Mohammad et al., 2017, Oota et al., 2019).
A plausible implication is that as these models further integrate contextual, temporally-aware, and user-specific features, performance on both cross-domain and longitudinal emotion prediction tasks may continue to improve. The consensus finding from ablation and transfer experiments is that affect-rich lexica remain foundational, even as neural representations mature.