AraGPT2: Arabic Transformer Model

Updated 3 July 2026

AraGPT2 is an Arabic-centric decoder-only Transformer model built for high-fidelity natural language generation and adaptable to low-resource domains.
It is pre-trained on diverse modern and classical Arabic corpora using advanced techniques such as quantization and unstructured L₁ pruning for efficient inference.
AraGPT2 demonstrates robust performance across tasks like medical dialogue, sign language translation, and synthetic news generation, with strategies to reduce hallucinations.

AraGPT2 is a family of Arabic-centric, decoder-only Transformer LLMs architecturally derived from GPT-2, purpose-built for high-fidelity Arabic natural language generation, text understanding, and low-resource domain adaptation. Trained from scratch on diverse modern and classical Arabic corpora, AraGPT2 spans multiple parameter regimes (135 M to 1.46 B) and underpins a range of research benchmarks in text generation, medical dialogue, and sign language translation, while also serving as a foundation for compression, adaptation, and hallucination reduction strategies in low-resource NLP tasks (Antoun et al., 2020, Alshehhi et al., 25 Jul 2025, Allam et al., 12 Sep 2025, Johnny et al., 26 Jul 2025).

1. Model Architecture and Variants

AraGPT2 utilizes a GPT-2–style, decoder-only Transformer design comprised of multi-head self-attention layers, positionwise feed-forward networks, and residual + layer normalization paths in GROVER ordering. The standard configuration for AraGPT2-base and its close derivatives is:

Variant	Layers (L)	Hidden Dim (d_model)	Heads (H)	Params (M) – B
AraGPT2-base	12	768	12	~135 M
AraGPT2-medium	24	1024	16	~370 M
AraGPT2-large	36	1280	20	~792 M
AraGPT2-mega	48	1536	24	~1.46 B

The tokenizer is trained from byte-level BPE (64 K for classic models; 50 K for compressed variants), with explicit support for Arabic morphology, diacritics, and script-specific artifacts such as zero-width joiners. The pretraining follows standard autoregressive next-token prediction, with objective

$\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(y_t \mid y_{<t})$

AraGPT2 models support sequences up to 1024 tokens (Antoun et al., 2020, Alshehhi et al., 25 Jul 2025, Johnny et al., 26 Jul 2025).

2. Pretraining and Data Sources

Pretraining draws on diverse corpora—Arabic Wikipedia, newswire (OSCAR, Gigaword), web-crawled data, public forums, and social media, encompassing between 77 GB (classic) and 200 GB (expanded training) of raw text. Data is aggressively filtered: documents <3 sentences or with >20% repetition are removed; email/URL/mention normalization and punctuation cleaning are applied; diacritics are stripped except for dialectal modeling; <|endoftext|> tokens punctuate documents. SentencePiece or byte-level BPE produces tokenizers adapted to Arabic morphology and script. Optimizers include large-batch LAMB, AdamW, and Adafactor (Antoun et al., 2020, Alshehhi et al., 25 Jul 2025, Allam et al., 12 Sep 2025).

3. Model Compression and Adaptation

AraGPT2 enables two key post-hoc compression methods:

Quantization: Post-training mapping to low-bit-width (8-bit, 4-bit) integers; 8-bit quantization halves memory footprint with negligible (<1 pp) accuracy loss, 4-bit halves memory again but introduces 1–3 pp drop. Quantization formula:

$q = \operatorname{round}(w/S) + Z$

$\hat{w} = S \cdot (q - Z)$

Unstructured L₁ pruning: Magnitude-based sparsification, up to 80% weights. Up to 20% sparsity shows <0.5 pp accuracy loss; beyond 40%, performance falls sharply. Pruning rule:

$w_{ij} = \begin{cases} 0 & \text{if } |w_{ij}| < \tau \ w_{ij} & \text{otherwise} \end{cases}$

Inference speedups are substantial for matrix multiplications using quantized weights; however, pruning yields less regular memory access (Alshehhi et al., 25 Jul 2025).

4. Evaluation: Language Modeling, Downstream Tasks, Hallucination

Intrinsic Metrics

AraGPT2-mega achieves a perplexity of 29.8 on held-out Arabic Wikipedia, a sharp improvement over former n-gram (PPL 430+) and RNN (PPL 480+) models (Antoun et al., 2020).

Downstream Tasks

Zero-shot QA: On TyDiQA-arabic and ARCD, exact match (EM) scores ~4%, F1 ~14%; manual rescoring indicates ~25% correctness, with answer phrasing affecting EM.
Synthetic News Generation: Human raters confused AraGPT2-generated articles with real ones ~60% of the time; chance level for human texts was ~50%. Discriminator built on AraELECTRA achieves 98% F1 in detecting machine-generated output (Antoun et al., 2020).
Medical Chatbots (fine-tuned): Fine-tuning AraGPT2-base on 20K real and 80K semantically filtered synthetic QA pairs yields BERTScore-F1 = 72.3%. Hallucination rate drops from ~18% (real only) to ~10% (augmented). ChatGPT-4o-generated synthetic data leads to higher F1 gains and lower hallucination than Gemini-generated data (Allam et al., 12 Sep 2025).
Sign Language Recognition Integration: As a decoder in AutoSign, AraGPT2 fine-tuned on pose-to-text pairs is fed via linear-projected 1D-CNN-compressed pose embeddings. This achieves a 20.5% WER, improving by 6.1 pp over the best prior (Swin-MSTP), and substantially outperforms pose-CTC Transformer baselines (Johnny et al., 26 Jul 2025).

Benchmark Comparisons

On ArabicMMLU (zero-shot MCQ), AraGPT2 achieves ~31% (small/medium), trailing BLOOMZ-7B (41.7%), AceGPT-13B (40.3%), and Jais-13B (36.0%). On English and Indic, AraGPT2’s monolingual focus leads to weak transfer (<25% accuracy versus >40% for multilinguals) (Alshehhi et al., 25 Jul 2025).

5. Synthetic Data Augmentation and Domain Adaptation

In low-resource tasks (e.g., Arabic medical QA), a hybrid corpus—mixing real and synthetic data from closed-source generative models—is used:

Pipeline: 20K real seed QAs expanded to 100K (20K real, 80K synthetic). ChatGPT-4o and Gemini 2.5 Pro generate 40K synthetic QAs each.
Filtering: Sentences embedded with an Arabic-tuned Sentence-BERT model. Cosine similarity to nearest seed QA is computed; pairs with $\cos(\mathbf{u},\mathbf{v})<0.75$ are discarded to enforce semantic/linguistic validity.
Manual Review: 500 sampled synthetic items are human-evaluated for plausibility, dialect, and fluency.
Training: Standard cross-entropy objective, 4 epochs, AdamW, learning rate $5\times10^{-5}$ , FP16, batch 8/GPU. No curriculum schedule; real:synth ratio is 1:4 (Allam et al., 12 Sep 2025).

Results and Implications

Synthetic augmentation yields a +7.26 pp improvement over real-only, and +16.85 pp over base. ChatGPT-4o data yields superior F1 and lower hallucination rates relative to Gemini. Despite gains, ~10% residual hallucination necessitates human oversight; rare conditions remain underrepresented (Allam et al., 12 Sep 2025).

6. Applications and Limitations

AraGPT2 forms a foundation for Arabic content creation (journalism, dialogue systems), data augmentation, and sign language translation. Its fine-tuned variants can be deployed as lightweight, on-premise medical chatbots (latency, privacy), or as flexible NLP components in resource-constrained settings.

Limitations include:

Persistent hallucinations (~10% in medical settings, higher under compression)
Inferior zero-shot generalization relative to large multilingual/instruction-tuned models
Degraded reasoning and factual reliability at moderate to high sparsity or low-bit quantization
Underfitting in validation scenarios, and limited translation/few-shot performance (Antoun et al., 2020, Alshehhi et al., 25 Jul 2025, Allam et al., 12 Sep 2025, Johnny et al., 26 Jul 2025)

Mitigation strategies: retrieval-augmented generation, self-consistency decoding, continued pretraining on domain-specific or classical texts, RLHF, and inclusion of robust human-in-the-loop procedures.

7. Future Directions

Enhancements foreseen for AraGPT2 involve:

Expansion to mixture-of-experts and larger-scale Transformer variants
Prompt engineering for zero/few-shot gains in translation, summarization, style transfer
Systematic integration with fact-checking/retrieval modules to suppress hallucinations
Pretraining on colloquial/multi-dialectal Arabic
Incorporation of advanced bias auditing and controlled generation protocols
Robust adversarial detection building beyond ELECTRA paradigms

The adaptability of AraGPT2 to compressed inference, multi-modal pipelines (e.g., AutoSign), and cross-domain augmentation positions it as a flexible backbone for low-resource Arabic NLP research and deployment (Antoun et al., 2020, Alshehhi et al., 25 Jul 2025, Allam et al., 12 Sep 2025, Johnny et al., 26 Jul 2025).