- The paper introduces deep embedding and transformer models that achieve up to 98% accuracy in detecting cyberbullying in Arabic social media content.
- Bi-LSTM with FastText outperforms traditional machine learning by effectively capturing subword nuances in morphologically rich Arabic texts.
- Hybrid models combining sequential LSTM and BERT-based approaches offer robust performance and faster convergence for real-time moderation.
Introduction
The detection of cyberbullying in Arabic-language social media content presents unique challenges due to the linguistic complexity of Arabic and the scarcity of high-quality annotated datasets. The paper "Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches" (2510.02232) addresses these challenges by constructing a novel dataset of Arabic posts from X (formerly Twitter) and systematically evaluating a range of machine learning, deep learning, and transformer-based models for cyberbullying detection. The paper emphasizes the importance of advanced word embeddings and hybrid architectures, particularly in the context of under-resourced languages.
Dataset Construction and Preprocessing
A dataset of 10,662 Arabic-language posts was collected from X using targeted keyword scraping. Manual annotation was performed with high inter-annotator agreement (Cohen's kappa = 0.98), ensuring label reliability. The dataset was balanced (53.8% bullying, 46.2% non-bullying). Preprocessing included normalization of Arabic script, removal of diacritics, stopword elimination, and stemming using the Snowball Stemmer. Non-Arabic characters, symbols, and duplicates were removed to reduce noise. Feature extraction leveraged TF-IDF, pre-trained Arabic word2vec, GloVe, and FastText embeddings, with vocabulary size and sequence length constraints to manage computational resources.
Model Architectures and Experimental Setup
The paper systematically compared:
- Baseline machine learning models: SVM, RF, KNN, LR, DT, using TF-IDF features.
- Deep learning models: LSTM and Bi-LSTM, each evaluated with TF-IDF, Araword2Vec, GloVe, and FastText embeddings.
- Transformer models: BERT variants pre-trained on Arabic (Arabertv02, CAMeL-da, CAMeL-mix).
- Hybrid models: LSTM-BERT and Bi-LSTM-BERT, integrating sequence modeling with transformer-based contextual embeddings.
All models were implemented in Python using Keras and Hugging Face Transformers, trained on Google Colab Pro with GPU acceleration. The dataset was split 80/20 for training/testing, and models were evaluated using accuracy, precision, recall, F1-score, and AUC.
Results and Analysis
Baseline and Deep Learning Models
- SVM achieved the highest accuracy among classical models (88%), but was outperformed by deep learning approaches.
- LSTM/Bi-LSTM with TF-IDF underperformed (F1 < 73%), indicating the limitations of sparse, context-agnostic features for Arabic.
- LSTM/Bi-LSTM with Araword2Vec and GloVe showed moderate improvements, but performance was inconsistent, likely due to coverage and quality of pre-trained embeddings.
- Bi-LSTM with FastText achieved the highest accuracy (98%) and F1-score (0.98), outperforming all other models. FastText's subword modeling is particularly effective for morphologically rich languages like Arabic, capturing out-of-vocabulary and dialectal variations.
- BERT (CAMeL-da) reached 97% accuracy, demonstrating the effectiveness of transformer-based contextual embeddings for Arabic cyberbullying detection.
- Hybrid Bi-LSTM-BERT (CAMeL-da) matched BERT's performance (97% accuracy), with marginal improvements in F1-score and training efficiency. The hybrid approach leverages both sequential modeling and deep contextual representations.
- BERT (Arabertv02) and its hybrid variants performed slightly lower (95–96% accuracy), suggesting that pre-training corpus and dialectal coverage are critical for optimal performance.
Comparative Insights
- Bi-LSTM-FastText and Bi-LSTM-BERT (CAMeL-da) are the top-performing models, both achieving ≥97% accuracy and F1-scores.
- Hybrid models (LSTM/Bi-LSTM-BERT) offer a trade-off between model complexity and performance, with early stopping and faster convergence observed in training.
- Classical machine learning approaches are consistently outperformed by deep and transformer-based models, especially when leveraging high-quality embeddings.
Implementation Considerations
- Resource Requirements: Training Bi-LSTM-FastText and BERT-based models requires GPU acceleration and substantial memory, especially for large embedding matrices and transformer fine-tuning.
- Data Quality: High annotation agreement and rigorous preprocessing are essential for robust model performance, particularly in low-resource language settings.
- Embedding Selection: FastText is highly effective for Arabic due to its subword modeling; however, transformer-based models pre-trained on in-domain and dialectal data (e.g., CAMeL-da) are competitive and more adaptable to context.
- Deployment: The proposed models can be integrated into real-time moderation pipelines for social media platforms, with Bi-LSTM-FastText offering a balance of accuracy and computational efficiency, while BERT-based models provide superior contextual understanding at higher computational cost.
Implications and Future Directions
The paper demonstrates that advanced deep learning and transformer-based models, when combined with appropriate embeddings, can achieve high accuracy in Arabic cyberbullying detection, surpassing traditional machine learning approaches. The results highlight the importance of subword-aware embeddings and domain-specific pre-training for morphologically complex languages.
Future research should focus on:
- Expanding datasets to cover more dialects and platforms (e.g., WhatsApp, Facebook).
- Exploring additional hybrid architectures (e.g., BiGRU-BERT) and ensemble methods.
- Investigating cross-lingual transfer and domain adaptation to further improve generalization.
- Addressing real-world deployment challenges, such as adversarial robustness, explainability, and privacy-preserving inference.
Conclusion
This work provides a comprehensive evaluation of deep embedding and transformer-based approaches for Arabic-language cyberbullying detection. The Bi-LSTM-FastText and Bi-LSTM-BERT (CAMeL-da) models achieve state-of-the-art performance, with 98% and 97% accuracy, respectively. The findings underscore the necessity of leveraging both subword-level and contextual embeddings for effective detection in morphologically rich, under-resourced languages. The methodologies and insights presented are directly applicable to the development of automated moderation tools for Arabic social media, with broader implications for multilingual and cross-dialectal NLP tasks.