AraBERT: Transformer-based Model for Arabic Language Understanding (2003.00104v4)

Published 28 Feb 2020 in cs.CL

Abstract: The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic NLP tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The pretrained araBERT models are publicly available on https://github.com/aub-mind/arabert hoping to encourage research and applications for Arabic NLP.

PDF Abstract

AraBERT: Transformer-Based Model for Arabic Language Understanding

The paper by Wissam Antoun, Fady Baly, and Hazem Hajj introduces AraBERT, a transformer-based model specifically designed for the Arabic language to handle various NLP tasks such as Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA). This work is motivated by the unique challenges posed by the Arabic language, which includes its morphological richness and the scarcity of large-scale datasets.

Background and Motivation

The surge in transformer-based models has significantly improved the performance of NLP tasks in English, setting new benchmarks. However, the application of these models to other languages, particularly Arabic, has been limited due to differences in morphological and syntactic structures from English, and a lack of large training corpora. Multilingual models, despite their broad application scope, often underperform compared to monolingual models trained on large specific datasets. This paper aims to bridge this gap by developing AraBERT, a BERT-based model pre-trained on a large Arabic corpus.

Methodology

AraBERT uses the BERT-base architecture, which includes 12 encoder blocks, 768 hidden dimensions, and 12 attention heads, totaling approximately 110 million parameters. The key steps in developing AraBERT include:

Pre-training Dataset: The authors compiled a comprehensive dataset comprising 70 million sentences (≈24GB of text) from various Arabic news sources, including the Arabic Wikipedia, the 1.5 billion words Arabic Corpus, and OSIAN.
Preprocessing: The Arabic language presents unique challenges due to its lexical sparsity. To address this, the authors used Farasa for segmenting words into stems, prefixes, and suffixes, and subsequently trained SentencePiece on this segmented data to create a 64k token vocabulary.
Training: The pre-training followed the Masked LLMing (MLM) and Next Sentence Prediction (NSP) objectives. Training was performed on a TPUv2-8 pod over 1,250,000 steps, with early steps using shorter sequence lengths to expedite the training process.
Fine-tuning: Fine-tuning was independently carried out for different tasks: sequence classification for SA, token classification for NER using the IOB2 format, and span prediction for QA.

Evaluation

AraBERT's performance was evaluated on multiple Arabic NLP tasks, comparing its results with those of previous state-of-the-art models and Google’s multilingual BERT (mBERT). The results demonstrated AraBERT's superior performance across the board:

Sentiment Analysis

AraBERT outperformed mBERT and existing models on several datasets, including HARD, ASTD, LABR, ArSenTD-Lev, and AJGT, achieving accuracy improvements ranging from 0.5% to over 9%.

Named Entity Recognition

AraBERT set a new state-of-the-art on the ANERcorp dataset, with a macro-F1 score of 84.2, surpassing the previous best Bi-LSTM-CRF model.

Question Answering

For QA, evaluated on the ARCD dataset, AraBERT achieved improvements in F1-score and exact match scores, demonstrating nuanced understanding of the language aid.

Implications and Future Work

AraBERT's success highlights the advantages of pre-training monolingual models on large, language-specific corpora. The results underline the importance of language-specific preprocessing steps, like those applied to Arabic, to mitigate lexical sparsity and improve model performance. The potential applications of AraBERT span across various NLP tasks, enhancing the capabilities in Arabic SA, NER, and QA, thereby facilitating better automated text understanding, classification, and extraction in the Arabic language.

Looking ahead, the authors aim to develop future versions of AraBERT that do not rely on external tokenizers and extend the model’s adaptability to various Arabic dialects. These advancements are expected to further enhance the model's capability and utility in both academic research and industrial applications involving Arabic NLP tasks.

By publicly releasing AraBERT, the authors hope to foster further research and development in Arabic NLP, establishing a new benchmark for LLMs in Arabic similar to the impact BERT has had on English NLP.

Conclusion

The development of AraBERT marks a significant advancement in the field of Arabic NLP. Contrasting sharply with general multilingual models, AraBERT's monolingual focus and extensive dataset usage underscore the necessity of tailored LLMs for achieving optimal performance. The success of this model sets a precedent for future efforts in creating specialized NLP models for other underrepresented languages.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Wissam Antoun (11 papers)
Fady Baly (3 papers)
Hazem Hajj (6 papers)

Citations (856)

View on Semantic Scholar

Related Papers

Find Related Papers