AraLingBench: Evaluating Arabic LLMs

Updated 19 November 2025

AraLingBench is a human-annotated benchmark that diagnoses Arabic linguistic competence by testing grammar, morphology, orthography, reading comprehension, and syntax.
It evaluates over 35 LLMs, revealing strong surface-level proficiency alongside significant challenges in deep grammatical and syntactic reasoning.
The open-sourced benchmark offers reproducible evaluations and targeted insights, driving improvements in both Arabic-centric and multilingual language models.

AraLingBench is a fully human-annotated benchmark engineered to rigorously diagnose the Arabic linguistic competence of LLMs. Addressing structural language understanding, AraLingBench is structured around five core categories—grammar, morphology, orthography (spelling), reading comprehension, and syntax—with 150 expert-designed multiple-choice items. Benchmark evaluations across 35 recent LLMs show strong surface proficiency but significant limitations in deeper grammatical and syntactic reasoning, exposing the persistent gap between factual recall and authentic linguistic mastery. AraLingBench serves as both a diagnostic and developmental tool for Arabic-centric and multilingual LLMs, with all items and code released openly for reproducibility (Zbib et al., 18 Nov 2025).

1. Rationale and Scope

Prevailing Arabic NLP benchmarks—such as ArabicMMLU, EXAMS, and 3LM—primarily assess factual knowledge and subject matter expertise. These resources are limited in diagnosing whether models possess core language skills, an especially acute gap in Arabic due to its morphological richness, inflectional complexity, and multi-tiered syntactic structures. AraLingBench is motivated by the need to (a) isolate and test fundamental linguistic abilities, (b) discern memorization or superficial pattern matching from genuine structural understanding, and (c) support targeted error analysis for model improvement.

Research questions guiding AraLingBench include:

Whether models display balanced competence across grammar, morphology, spelling, reading comprehension, and syntax.
Inter-correlation of these skills and whether, for example, robust morphology predicts grammatical accuracy.
Predictiveness of general knowledge-based benchmarks (e.g., ArabicMMLU) for core linguistic abilities.
Concordance between human-annotated difficulty levels and actual model performance (Zbib et al., 18 Nov 2025).

2. Benchmark Construction and Category Framework

AraLingBench is composed of 150 balanced, multiple-choice questions across five domains:

Category	Core Phenomena Tested
Grammar	Agreement, case, proclitics, clitics
Morphology	Root-pattern derivation, inflection, plural formation, conjugation
Spelling	Hamza/shadda usage, affix placement, orthographic correctness
Read. Compr.	Passage-level inference, lexical comprehension
Syntax	Phrase structure, word order (VSO/SVO), embedded clauses

Each category contains 30 items. Questions use predominantly four-choice formats (83.3%) with a minority having three choices (16.7%). Difficulty levels are expert-annotated as Easy (33.3%), Medium (49.3%), and Hard (17.3%). Category design is formalized with explicit production rules, e.g., for morphology: $\text{Form}(r, P) = \alpha_1 c_1 \alpha_2 c_2 \alpha_3 c_3 \alpha_4 \alpha_5$ where $r=(c_1, c_2, c_3)$ is a triliteral root and $P$ the pattern. Syntax is captured in simplified phrase structure grammar, e.g., NP → (Det) N (PP)* (Zbib et al., 18 Nov 2025).

3. Dataset Creation and Quality Control

The construction pipeline comprises:

Expert Item Generation: Five native-speaking linguists create candidate MCQs targeting the specified linguistic phenomena.
Native-Speaker Validation: A separate group assesses clarity and challenge, filtering out trivial or overly narrow items.
Senior Linguist Review: All items undergo review for ambiguity, uniqueness of correct answer, and proper categorial tagging.
Difficulty Annotation: Three independent annotators label difficulty; disagreements (<5%) are resolved by committee.

Representative question types are distributed as follows:

Grammar: Subject–verb agreement resolution.
Morphology: Identification of source forms for derived words.
Spelling: Orthographic correctness and diacritic placement.
Reading Comprehension: Passage-level inference.
Syntax: Acceptable positions of constituents in canonical and noncanonical word order (Zbib et al., 18 Nov 2025).

4. Evaluation Protocol and Metrics

Models are evaluated in zero-shot mode: each prompt, fully in Arabic, lists options (A–D). No chain-of-thought or in-context examples are provided. The evaluation metric is standard accuracy,

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\{\hat y_i = y_i\}$

where $\hat y_i$ is the model’s prediction and $y_i$ the gold answer, over $N$ questions per category. Thirty-five LLMs are covered, spanning Arabic-centric (e.g., Hala, Yehia, Fanar, AceGPT), Arabic–English bilingual (JAIS, ALLaM, SUHAIL) and multilingual (Qwen2.5-14B, Phi-4-Mini) models from 350M to 70B parameters (Zbib et al., 18 Nov 2025).

5. Empirical Findings

Per-Category and Overall Performance

Top models (Yehia-7B-preview, ALLaM-7B) achieve mean accuracy of 74%. Spelling and reading comprehension are most tractable (median 59%); syntax is the most challenging (median 48%, best open models ≈60%). Morphology achieves median 60%, with top models at 80%. Significant inter-model variance is present (IQR ≈ 15–20 points). Category-level boxplots confirm that only surface-level skills are robustly acquired (Zbib et al., 18 Nov 2025).

Model	Spelling	Syntax	Morph.	Grammar	Read. Comp.	Avg
Yehia-7B-preview	86.7	53.3	80.0	80.0	70.0	74.0
ALLaM-7B-Instruct	86.7	60.0	73.3	73.3	76.7	74.0
Hala-350M	36.7	43.3	30.0	46.7	36.7	38.7

Correlation and Cross-Benchmark Predictiveness

Skill correlation matrixes show strong coupling of morphology and grammar ( $r=0.83$ ), and grammar–spelling ( $r=0.86$ ); syntax correlates only weakly with other skills ( $r=0.13\text{–}0.40$ ). AraLingBench performance correlates strongly with knowledge-oriented benchmarks (ArabicMMLU, $r=0.88$ ); but not with retrieval-augmented benchmarks (e.g., ALRAGE, $r=-0.54$ ). High factual recall does not imply deeper structural competence (Zbib et al., 18 Nov 2025).

Difficulty Analysis

Human-annotated difficulty is only loosely predictive of actual LLM performance (Easy ≈58%, Medium ≈50%, Hard ≈54%), with non-monotonic performance patterns and some models outperforming on hardest items. This suggests a misalignment between human and model-perceived difficulty (Zbib et al., 18 Nov 2025).

6. Linguistic and Modeling Insights

AraLingBench demonstrates that state-of-the-art Arabic LLMs succeed on surface-level tasks, but systematically fail in complex morphosyntactic and syntactic reasoning. Morphology and grammar display high co-development, whereas syntax emerges as a distinct and persistent challenge. The data support the need for explicit inductive biases or auxiliary syntactic objectives—e.g., hierarchical attention or syntactic pretraining—to elevate syntactic competence.

Recommended extensions include:

Integrating explicit morphological analyzers or syntactic parsers during pretraining/fine-tuning.
Broader coverage of dialectal and register variation.
Inclusion of open-ended generation items to evaluate productive syntactic generalization (Zbib et al., 18 Nov 2025).

7. Extended Benchmarking and Dialectal/Cultural Integration

Beyond core structural evaluation, the integration of dialectal and cultural competency tracks—via AraDiCE—expands scope to Levantine, Egyptian, Gulf dialects and introduces regional cultural QA. These modular extensions (dialect identification, generation, MT, commonsense, reading comprehension, misinformation, cultural knowledge) use uniform prompts, human post-editing, and dialect-specific sampling to expose LLMs' limits in dialect processing and cultural alignment. Empirical results demonstrate that Arabic-centric models (AceGPT, Jais) outperform generalist LLMs, but dialect identification, generation, and translation remain fundamentally challenging, with SOTA on dedicated models at F1 ≈0.85 but open models significantly lower (e.g., Llama3 F1=0.42) (Mousi et al., 2024).

This expanded protocol allows AraLingBench to serve as a unified, holistic diagnostic for MSA and dialectal/cultural capabilities, driving leaderboard-based evaluation and the development of dialectally robust and culturally aware Arabic LLMs (Mousi et al., 2024).

8. Accessibility and Reproducibility

All evaluation items, gold answers, code, and templates for zero-shot prompting are open-sourced at https://github.com/hammoudhasan/AraLingBench. These resources facilitate direct replication, model comparison, and diagnostic error analysis, establishing AraLingBench as the current reference for core Arabic linguistic capabilities in LLMs (Zbib et al., 18 Nov 2025).

References:

"AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of LLMs" (Zbib et al., 18 Nov 2025)
"AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs" (Mousi et al., 2024)
"LAraBench: Benchmarking Arabic AI with LLMs" (Abdelali et al., 2023)