Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media (2007.13184v1)

Published 26 Jul 2020 in cs.CL

Abstract: In this paper, we describe our approach to utilize pre-trained BERT models with Convolutional Neural Networks for sub-task A of the Multilingual Offensive Language Identification shared task (OffensEval 2020), which is a part of the SemEval 2020. We show that combining CNN with BERT is better than using BERT on its own, and we emphasize the importance of utilizing pre-trained LLMs for downstream tasks. Our system, ranked 4th with macro averaged F1-Score of 0.897 in Arabic, 4th with score of 0.843 in Greek, and 3rd with score of 0.814 in Turkish. Additionally, we present ArabicBERT, a set of pre-trained transformer LLMs for Arabic that we share with the community.

Overview of BERT-CNN for Offensive Speech Identification in Social Media

This essay examines the paper titled "KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media," authored by Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. The paper reports on the integration of Bidirectional Encoder Representations from Transformers (BERT) and Convolutional Neural Networks (CNN) to address the problem of offensive language identification on social media, specifically within the context of the OffensEval 2020 competition.

The paper's primary focus was to evaluate BERT's effectiveness in encoding contextual language information when combined with CNN's capability to capture local textual features. The authors participated in OffensEval 2020 Subtask-A, which aimed to identify offensive language in Arabic, Greek, and Turkish tweets. Their proposed BERT-CNN model achieved substantial performance, ranking 3rd to 4th across different languages with macro-averaged F1-Scores of 0.814 (Turkish), 0.843 (Greek), and 0.897 (Arabic).

Data and Methodology

The research involved utilizing tweet datasets annotated for offensive content in multiple languages. Specifically for Greek and Turkish, language-specific pre-trained BERT models were employed alongside the multilingual BERT (mBERT). Notably, the absence of a pre-trained Arabic BERT model motivated the creation of the ArabicBERT, which includes four model variants differentiated by size and trained on extensive corpora including Wikipedia and OSCAR data.

Concretely, the BERT-CNN model processes input tweets by first extracting contextualized embeddings from BERT’s last four hidden layers. These embeddings are then fed into CNN with multiple convolutional filters, capturing diverse n-gram features. These processed features are aggregated via a global max-pooling layer before being subjected to a dense layer and sigmoid output for classification.

Experimental Outcomes

The empirical evaluation compared the proposed model to several baselines including Support Vector Machines (SVM) with Term Frequency-Inverse Document Frequency (TF-IDF), CNN with random embeddings, BiLSTM, and standalone BERT models. The BERT-CNN approach demonstrated superior performance, especially when compared to models without pre-trained embeddings or those leveraging standard Multilingual BERT instead of language-specific variants.

Implications and Future Directions

The research elucidates the nuanced interaction between transformer-based embeddings and convolutional networks for nuanced text classification tasks such as offensive language detection. The findings reinforce the proposition that pre-trained LLMs, when combined with sophisticated classifiers like CNN, can substantially elevate performance metrics in multilingual contexts.

The introduction of ArabicBERT represents an exemplary contribution, expanding the range of available resources for Arabic NLP and enabling more tailored processing suitable for mixed-script texts prevalent in informal communication.

Moving forward, this paper prompts additional exploration into the concatenation of contextual LLMs with traditional neural architectures. Future advancements may involve experimenting with varied architectural designs, experimenting with transformer depth and complexity, and extending such methodologies to further languages and dialects. Additionally, leveraging these models in real-time settings may necessitate optimizations concerning computational efficiency and deployment on less powerful hardware.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ali Safaya (8 papers)
  2. Moutasem Abdullatif (1 paper)
  3. Deniz Yuret (26 papers)
Citations (297)