SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) (1903.08983v3)

Published 19 Mar 2019 in cs.CL

Abstract: We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets. It featured three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. In sub-task B, the focus was on the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, about 800 teams signed up to participate in the task, and 115 of them submitted results, which we present and analyze in this report.

PDF Abstract

An Analytical Review of "SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)"

The paper "SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)" by Zampieri et al. presents the outcomes and principal conclusions of a high-profile shared task focused on identifying and categorizing offensive language on social media, conducted within the framework of SemEval-2019. This task leveraged the newly constructed Offensive Language Identification Dataset (OLID), containing annotations for over 14,000 English tweets. The task was subdivided into three primary sub-tasks, each tackling distinct facets of offensive content categorization.

Overview of OffensEval Tasks

Sub-task A: Offensive Language Identification

Sub-task A involved binary classification to distinguish between offensive (OFF) and non-offensive (NOT) posts. The evaluative metric used was macro-averaged F1-score to address class imbalance. Out of 104 participating teams, the top-performing system utilized BERT, achieving an F1-score of 82.9%.

Sub-task B: Categorization of Offense Types

Sub-task B aimed at categorizing offensive content into targeted insults (TIN) and untargeted profanity (UNT). Notably, the best system, a rule-based model, scored 75.5% in F1-measure, showcasing non-traditional modelling methods could still perform competitively against deep learning counterparts. An ensemble approach was also prevalent among top-performing teams, indicating the complexity of this categorization task.

Sub-task C: Target Identification

Sub-task C concentrated on identifying the target of the offense, categorizing them as individuals (IND), groups (GRP), or others (OTH). With 66 participating teams, the best-performing system, again leveraging BERT, achieved a 66% F1-score. The mixed performance across systems in this sub-task underscores the difficulty inherent in pinpointing offensive targets, particularly when the offensive content can be more context-dependent.

Analysis of Methodologies and Results

A notable trend among top-performing teams was the diverse approach towards model selection and usage of ensembles. In sub-task A, BERT was the dominant model. For sub-tasks B and C, however, ensembles gained prominence, indicating the nuanced nature of these tasks required integrating strengths from various modeling paradigms, including traditional machine learning and state-of-the-art deep learning techniques.

Traditional machine learning classifiers like SVM and logistic regression, although less frequently at the forefront, were notable in some ensemble methods. Pre-processing techniques, such as token normalization, hashtag segmentation, and emoji substitution, were critical across the board to enhance model performance.

One notable takeaway was the variance in effectiveness of deep learning models depending on the sub-task. For instance, while BERT excelled consistently in sub-task A and C, rule-based and ensemble approaches proved more effective in sub-task B.

Implications and Future Directions

Practically, the findings from OffensEval have direct applications in moderating social media content and developing automated systems for offensive language detection, thus alleviating the cognitive burden on human moderators. The diversity in successful approaches highlights that while deep learning models like BERT are powerful, there remains room for traditional and rule-based methods, particularly when integrated into ensemble systems.

Theoretically, this task sheds light on the underlying complexities in identifying and categorizing offensive language. It emphasizes the multifaceted nature of offensive content, suggesting that future research could benefit from exploring more context-aware and hybrid approaches.

Conclusion

OffensEval-2019 proved to be a comprehensive benchmark for evaluating offensive language identification systems, attracting substantial participation, and showcasing a variety of successful methods. The results and insights gained present valuable directions for both practical applications and theoretical advancements in natural language processing. Further developments could include expanding datasets, reducing class imbalances, and addressing misclassifications through more sophisticated contextual embeddings and multi-task learning models. This task reinforces the ongoing need for nuanced and adaptable tools in the ever-evolving landscape of social media communication.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Marcos Zampieri (94 papers)
Shervin Malmasi (40 papers)
Preslav Nakov (253 papers)
Sara Rosenthal (21 papers)
Noura Farra (6 papers)
Ritesh Kumar (42 papers)

Citations (761)

View on Semantic Scholar