Papers
Topics
Authors
Recent
2000 character limit reached

KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media

Published 26 Jul 2020 in cs.CL | (2007.13184v1)

Abstract: In this paper, we describe our approach to utilize pre-trained BERT models with Convolutional Neural Networks for sub-task A of the Multilingual Offensive Language Identification shared task (OffensEval 2020), which is a part of the SemEval 2020. We show that combining CNN with BERT is better than using BERT on its own, and we emphasize the importance of utilizing pre-trained LLMs for downstream tasks. Our system, ranked 4th with macro averaged F1-Score of 0.897 in Arabic, 4th with score of 0.843 in Greek, and 3rd with score of 0.814 in Turkish. Additionally, we present ArabicBERT, a set of pre-trained transformer LLMs for Arabic that we share with the community.

Citations (297)

Summary

  • The paper demonstrates that combining BERT with CNN significantly improves offensive speech detection through contextual embedding and local feature extraction.
  • The approach outperforms baselines across multiple languages, achieving macro F1-scores of 0.814 (Turkish), 0.843 (Greek), and 0.897 (Arabic).
  • It introduces ArabicBERT to fill resource gaps in Arabic NLP, exemplifying effective strategies for multilingual offensive language identification.

Overview of BERT-CNN for Offensive Speech Identification in Social Media

This essay examines the paper titled "KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media," authored by Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. The paper reports on the integration of Bidirectional Encoder Representations from Transformers (BERT) and Convolutional Neural Networks (CNN) to address the problem of offensive language identification on social media, specifically within the context of the OffensEval 2020 competition.

The study's primary focus was to evaluate BERT's effectiveness in encoding contextual language information when combined with CNN's capability to capture local textual features. The authors participated in OffensEval 2020 Subtask-A, which aimed to identify offensive language in Arabic, Greek, and Turkish tweets. Their proposed BERT-CNN model achieved substantial performance, ranking 3rd to 4th across different languages with macro-averaged F1-Scores of 0.814 (Turkish), 0.843 (Greek), and 0.897 (Arabic).

Data and Methodology

The research involved utilizing tweet datasets annotated for offensive content in multiple languages. Specifically for Greek and Turkish, language-specific pre-trained BERT models were employed alongside the multilingual BERT (mBERT). Notably, the absence of a pre-trained Arabic BERT model motivated the creation of the ArabicBERT, which includes four model variants differentiated by size and trained on extensive corpora including Wikipedia and OSCAR data.

Concretely, the BERT-CNN model processes input tweets by first extracting contextualized embeddings from BERT’s last four hidden layers. These embeddings are then fed into CNN with multiple convolutional filters, capturing diverse n-gram features. These processed features are aggregated via a global max-pooling layer before being subjected to a dense layer and sigmoid output for classification.

Experimental Outcomes

The empirical evaluation compared the proposed model to several baselines including Support Vector Machines (SVM) with Term Frequency-Inverse Document Frequency (TF-IDF), CNN with random embeddings, BiLSTM, and standalone BERT models. The BERT-CNN approach demonstrated superior performance, especially when compared to models without pre-trained embeddings or those leveraging standard Multilingual BERT instead of language-specific variants.

Implications and Future Directions

The research elucidates the nuanced interaction between transformer-based embeddings and convolutional networks for nuanced text classification tasks such as offensive language detection. The findings reinforce the proposition that pre-trained LLMs, when combined with sophisticated classifiers like CNN, can substantially elevate performance metrics in multilingual contexts.

The introduction of ArabicBERT represents an exemplary contribution, expanding the range of available resources for Arabic NLP and enabling more tailored processing suitable for mixed-script texts prevalent in informal communication.

Moving forward, this study prompts additional exploration into the concatenation of contextual LLMs with traditional neural architectures. Future advancements may involve experimenting with varied architectural designs, experimenting with transformer depth and complexity, and extending such methodologies to further languages and dialects. Additionally, leveraging these models in real-time settings may necessitate optimizations concerning computational efficiency and deployment on less powerful hardware.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.