Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 113 tok/s Pro

Kimi K2 200 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches (2510.02232v1)

Published 2 Oct 2025 in cs.CL

Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize

Summary

The paper introduces deep embedding and transformer models that achieve up to 98% accuracy in detecting cyberbullying in Arabic social media content.
Bi-LSTM with FastText outperforms traditional machine learning by effectively capturing subword nuances in morphologically rich Arabic texts.
Hybrid models combining sequential LSTM and BERT-based approaches offer robust performance and faster convergence for real-time moderation.

Enhanced Arabic-language Cyberbullying Detection: Deep Embedding and Transformer (BERT) Approaches

Introduction

The detection of cyberbullying in Arabic-language social media content presents unique challenges due to the linguistic complexity of Arabic and the scarcity of high-quality annotated datasets. The paper "Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches" (2510.02232) addresses these challenges by constructing a novel dataset of Arabic posts from X (formerly Twitter) and systematically evaluating a range of machine learning, deep learning, and transformer-based models for cyberbullying detection. The paper emphasizes the importance of advanced word embeddings and hybrid architectures, particularly in the context of under-resourced languages.

Dataset Construction and Preprocessing

A dataset of 10,662 Arabic-language posts was collected from X using targeted keyword scraping. Manual annotation was performed with high inter-annotator agreement (Cohen's kappa = 0.98), ensuring label reliability. The dataset was balanced (53.8% bullying, 46.2% non-bullying). Preprocessing included normalization of Arabic script, removal of diacritics, stopword elimination, and stemming using the Snowball Stemmer. Non-Arabic characters, symbols, and duplicates were removed to reduce noise. Feature extraction leveraged TF-IDF, pre-trained Arabic word2vec, GloVe, and FastText embeddings, with vocabulary size and sequence length constraints to manage computational resources.

Model Architectures and Experimental Setup

The paper systematically compared:

Baseline machine learning models: SVM, RF, KNN, LR, DT, using TF-IDF features.
Deep learning models: LSTM and Bi-LSTM, each evaluated with TF-IDF, Araword2Vec, GloVe, and FastText embeddings.
Transformer models: BERT variants pre-trained on Arabic (Arabertv02, CAMeL-da, CAMeL-mix).
Hybrid models: LSTM-BERT and Bi-LSTM-BERT, integrating sequence modeling with transformer-based contextual embeddings.

All models were implemented in Python using Keras and Hugging Face Transformers, trained on Google Colab Pro with GPU acceleration. The dataset was split 80/20 for training/testing, and models were evaluated using accuracy, precision, recall, F1-score, and AUC.

Results and Analysis

Baseline and Deep Learning Models

SVM achieved the highest accuracy among classical models (88%), but was outperformed by deep learning approaches.
LSTM/Bi-LSTM with TF-IDF underperformed (F1 < 73%), indicating the limitations of sparse, context-agnostic features for Arabic.
LSTM/Bi-LSTM with Araword2Vec and GloVe showed moderate improvements, but performance was inconsistent, likely due to coverage and quality of pre-trained embeddings.
Bi-LSTM with FastText achieved the highest accuracy (98%) and F1-score (0.98), outperforming all other models. FastText's subword modeling is particularly effective for morphologically rich languages like Arabic, capturing out-of-vocabulary and dialectal variations.

Transformer and Hybrid Models

BERT (CAMeL-da) reached 97% accuracy, demonstrating the effectiveness of transformer-based contextual embeddings for Arabic cyberbullying detection.
Hybrid Bi-LSTM-BERT (CAMeL-da) matched BERT's performance (97% accuracy), with marginal improvements in F1-score and training efficiency. The hybrid approach leverages both sequential modeling and deep contextual representations.
BERT (Arabertv02) and its hybrid variants performed slightly lower (95–96% accuracy), suggesting that pre-training corpus and dialectal coverage are critical for optimal performance.

Comparative Insights

Bi-LSTM-FastText and Bi-LSTM-BERT (CAMeL-da) are the top-performing models, both achieving ≥97% accuracy and F1-scores.
Hybrid models (LSTM/Bi-LSTM-BERT) offer a trade-off between model complexity and performance, with early stopping and faster convergence observed in training.
Classical machine learning approaches are consistently outperformed by deep and transformer-based models, especially when leveraging high-quality embeddings.

Implementation Considerations

Resource Requirements: Training Bi-LSTM-FastText and BERT-based models requires GPU acceleration and substantial memory, especially for large embedding matrices and transformer fine-tuning.
Data Quality: High annotation agreement and rigorous preprocessing are essential for robust model performance, particularly in low-resource language settings.
Embedding Selection: FastText is highly effective for Arabic due to its subword modeling; however, transformer-based models pre-trained on in-domain and dialectal data (e.g., CAMeL-da) are competitive and more adaptable to context.
Deployment: The proposed models can be integrated into real-time moderation pipelines for social media platforms, with Bi-LSTM-FastText offering a balance of accuracy and computational efficiency, while BERT-based models provide superior contextual understanding at higher computational cost.

Implications and Future Directions

The paper demonstrates that advanced deep learning and transformer-based models, when combined with appropriate embeddings, can achieve high accuracy in Arabic cyberbullying detection, surpassing traditional machine learning approaches. The results highlight the importance of subword-aware embeddings and domain-specific pre-training for morphologically complex languages.

Future research should focus on:

Expanding datasets to cover more dialects and platforms (e.g., WhatsApp, Facebook).
Exploring additional hybrid architectures (e.g., BiGRU-BERT) and ensemble methods.
Investigating cross-lingual transfer and domain adaptation to further improve generalization.
Addressing real-world deployment challenges, such as adversarial robustness, explainability, and privacy-preserving inference.

Conclusion

This work provides a comprehensive evaluation of deep embedding and transformer-based approaches for Arabic-language cyberbullying detection. The Bi-LSTM-FastText and Bi-LSTM-BERT (CAMeL-da) models achieve state-of-the-art performance, with 98% and 97% accuracy, respectively. The findings underscore the necessity of leveraging both subword-level and contextual embeddings for effective detection in morphologically rich, under-resourced languages. The methodologies and insights presented are directly applicable to the development of automated moderation tools for Arabic social media, with broader implications for multilingual and cross-dialectal NLP tasks.