TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification (2010.12421v2)

Published 23 Oct 2020 in cs.CL and cs.SI

Abstract: The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domain-specific data. In this paper, we propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different LLMing pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pre-trained generic LLMs, and continue training them on Twitter corpora.

Citations (642)

View on Semantic Scholar

Summary

The paper introduces a unified benchmark framework for tweet classification across tasks including sentiment, emotion, and hate speech.
It compares different pre-training strategies, showing that continuing training RoBERTa on Twitter data yields superior performance.
The framework promotes consistency in model evaluation for social media and guides future improvements in domain-specific NLP research.

TweetEval: A Unified Benchmark for Tweet Classification

The paper "TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification" introduces a comprehensive evaluation framework aimed at enhancing the applicability and assessment of NLP models specifically for Twitter data. This benchmark addresses a significant gap in the domain of social media text processing due to the absence of a standardized evaluation mechanism akin to those available for more conventional text.

Overview of TweetEval

TweetEval curates a set of seven distinct classification tasks tailored for Twitter: sentiment analysis, emotion recognition, offensive language detection, hate speech detection, stance prediction, emoji prediction, and irony detection. These tasks encapsulate the peculiarities and challenges associated with classifying tweets, such as brevity, idiosyncratic language, and platform-specific dynamics like hashtags and mentions.

In developing TweetEval, the authors integrate datasets from previously held SemEval competitions, thereby consolidating an expansive range of tasks under a unified evaluation protocol. This initiative fosters consistency and comparability in evaluating new models on Twitter-specific data.

Methodology and Results

The research explores multiple pre-training strategies for transformer-based models in the context of tweet classification. Three strategies are compared:

Utilizing existing pre-trained LLMs (e.g., RoBERTa) without modifications.
Training LLMs from scratch using solely Twitter data.
Continuing training of pre-existing models using supplementary Twitter corpus data.

The paper's results suggest that RoBERTa, when further trained with Twitter data (RoB-RT), achieves superior performance across most tasks compared to models trained solely on Twitter or those without additional training. This finding indicates the advantages of leveraging both large generalized corpora and domain-specific data for enhancing model efficacy on social media texts.

Implications and Future Directions

The introduction of TweetEval as a unified evaluation framework holds substantial theoretical and practical implications. Practically, it simplifies the benchmarking process for new models developed for social media, promoting more robust model comparisons. Theoretical implications include providing insights into model adaptations necessary for handling social media text's unique characteristics, thereby guiding future development of more sophisticated NLP models.

Moving forward, the authors propose expanding TweetEval to include additional tasks that capture the multi-faceted nature of social media NLP, such as multi-label classification scenarios and handling multimodal inputs. Furthermore, explorations in multitask learning could leverage the tasks' interrelations to enhance model generalization and performance.

Overall, TweetEval stands as a significant contribution to the field, providing a structured pathway toward advancing NLP methodologies specifically for the dynamic environment of social media.

PDF Markdown