AraBART+Morph+GEC for Arabic Grammatical Correction

Updated 25 November 2025

The paper introduces AraBART+Morph+GEC, which combines a BART-based encoder-decoder model with detailed morphological embeddings and a GED objective to improve Arabic error correction.
It employs a refined edit selection pipeline using logistic regression, agreement boosting, and non-maximum suppression, achieving up to 84.64% F₀.₅ on QALB-15 benchmarks.
Serving as a key component of the ArbESC+ ensemble, the system leverages both neural and linguistic features to set a new state-of-the-art for Arabic grammatical error correction.

AraBART+Morph+GEC is an Arabic grammatical error correction (GEC) system integrating a BART-based sequence-to-sequence architecture, explicit morphological analysis, and a parallel grammatical error detection (GED) objective. Developed as a key component of the ArbESC+ multi-system edit selection framework, it leverages both neural and linguistic features to address the challenges of morphologically rich and syntactically complex Arabic text. The system combines span-based edit proposals from independently trained variants, enabling fine-grained correction decisions within a larger ensemble strategy (Alrehili et al., 18 Nov 2025).

1. Architecture of AraBART+Morph+GEC

1.1 Base AraBART Backbone

AraBART employs the encoder–decoder “denoising” transformer originally proposed by Lewis et al. (2019), re-pretrained on extensive Arabic corpora as described in Antoun et al. (2020). Arabic-specific modifications include a BPE vocabulary (approximately 42,000 tokens), script-level adjustments for right-to-left text, and orthographic normalization. Pretraining objectives follow standard BART, encompassing masked token infilling, masked span infilling, and next-sentence prediction, all adapted for Arabic data.

1.2 Morphological Feature Integration

Morphological information is introduced via CAMeL Tools’ MADA+ analyzer. For each input token position $i$ , discrete features are extracted:

POS tag $t_i\in T^{\mathrm{POS}}$
Stem $s_i$
Root $r_i$
Additional attributes $f_i\in F$ (number, gender, case, etc.)

These features are embedded as follows:

$E^m_{(i)} = E^{\mathrm{POS}}[t_i] + E^{\mathrm{stem}}[s_i] + E^{\mathrm{root}}[r_i] + \sum_{f\in F} E^f[f_i]$

The encoder’s input vector at each position is

$h^{(0)}_i = E^{\text{tok}}_{(i)} + E^{\text{pos}}_i + E^m_{(i)}$

Optionally, internal layers inject morphological embeddings into the multi-head self-attention keys and values,

$Q^\ell = W_Q h^{(\ell-1)} \; ; \; K^\ell = W_K h^{(\ell-1)} + W_m E^m \; ; \; V^\ell = W_V h^{(\ell-1)} + W'_m E^m$

enabling direct incorporation of morphological cues in attention computations.

1.3 GEC-specific Multi-task Objectives

The model is trained for both text generation and error detection:

Sequence generation with standard cross-entropy loss:

$L_{\text{seq}} = -\sum_t \log p(y_t \mid y_{<t}, x)$

Grammatical error detection (GED):

$L_{\text{GED}} = -\sum_i \left[ g_i \log \sigma(u_i) + (1-g_i)\log(1-\sigma(u_i)) \right]$

where $u_i$ is the GED logit for token $i$ , $g_i$ is its gold label, and $\sigma$ is the sigmoid. The full objective is $L = L_{\text{seq}} + \lambda L_{\text{GED}}$ with $\lambda=1.0$ .

1.4 Training Regime and Hyperparameters

Data: QALB-2014, QALB-2015, ZAEBUC corpora (joint/separate variants)
Optimization: AdamW, learning rate $2 \times 10^{-5}$ , weight decay 0.01
Batch size: 16, mixed precision (fp16)
Training epochs: 50, early stopping via development set
Inference: beam search, beam size 5, max output length 100

2. Generation and Featureization of Correction Proposals

2.1 Candidate Edit Extraction

At inference, three independently trained AraBART+Morph+GEC models (corresponding to QALB-14, QALB-15, and ZAEBUC domains) generate corrected sentences. Source-to-output alignments yield proposed span edits $e=(a,b,r)$ , interpreted as replacements of source tokens $[a..b-1]$ by string $r$ .

2.2 Numerical Feature Representation

For each edit, multiple features are computed:

System confidence: For a proposal $e$ from system $k$ , the normalized probability mass assigned to edits containing $e$ across output beams,

$c_k(e) = \sum_{b\in\text{beams}} \mathbf{1}[e\in E(y_b)] \cdot \mathrm{softmax}_b(s_b)$

Morphological consistency: $M_C(e)\in[0,1]$ measures alignment between the replacement $r$ and MADA+ predicted gold features:

$M_C(e) = \frac{1}{|F|}\sum_{f\in F}\mathbf{1}[f(\text{predicted on }r) = f(\text{gold})]$

Span features: Size of the replaced span $(b-a)$ and length of $r$ .

3. Edit Selection: Classifier and Decision Pipeline

3.1 Feature Vector Construction

A binary feature vector $x_e\in\{0,1\}^{K\times T}$ (where $K=9$ is the number of systems, $T=3$ is the number of edit types) specifies which system(s) proposed $e$ of each type (insertion, deletion, substitution). Optionally, real-valued meta-features (system confidence, morphological consistency, span length) are appended.

3.2 Logistic Regression Scoring

Each candidate $e$ receives a raw probability score via logistic regression,

$p_{\text{raw}}(e) = \sigma(w^\top x_e + b)$

optimized with binary cross-entropy on labeled edits.

3.3 Agreement Boosting and Dual-Threshold Filtering

System agreement is quantified:

$n(e) = \sum_k \mathbf{1}[\text{system }k\text{ proposed }e]$
Boost factor: $\mathrm{boost}(e) = \min(1+\beta(n(e)-1),c)$
Adjusted score: $p_{\text{adj}}(e) = p_{\text{raw}}(e)\cdot\mathrm{boost}(e)$

Candidate $e$ is accepted if $p_{\text{raw}}(e)\geq \tau$ and $p_{\text{adj}}(e)\geq \alpha\tau$ , enforcing both raw confidence and agreement.

3.4 Non-Maximum Suppression for Conflict Resolution

Inter-edit overlap is measured by one-dimensional IoU:

$\mathrm{IoU}(e_i,e_j) = \frac{\max(0,\min(b_i,b_j)-\max(a_i,a_j))}{(b_i-a_i)+(b_j-a_j)-\max(0,\min(b_i,b_j)-\max(a_i,a_j))}$

A greedy non-maximum suppression (NMS) procedure selects highest $p_{\text{adj}}$ edits while ensuring non-overlapping spans (threshold $\theta=0$ ), with at most one insertion per position.

4. System Combination in ArbESC+ Framework

4.1 Model Ensemble

The full ArbESC+ system integrates:

Four sequence-to-sequence GEC models: AraT5, ByT5, mT5, AraBART
Three AraBART+Morph+GEC models (trained on QALB-14, QALB-15, ZAEBUC)
Two text-editing models

This ensemble yields $K=9$ candidate outputs per sentence.

4.2 Combination and Decision Pipeline

The ensemble workflow is as follows:

Aggregate unique span edits from all $9$ systems.
Encode features for each edit as described above.
Score with logistic regression.
Apply agreement boosting and dual-threshold filtering.
Resolve conflicts via NMS.
Sequentially apply surviving edits to the left-to-right source.

4.3 Rationale for Micro-edit Level Combination

Micro-edit based voting enables fine-grained error correction where edits, rather than whole sentences, are the central decision unit. This enables contributions from high-confidence system components even when they disagree on overall sentence structure. Thresholding and agreement-based boosting limit spurious or low-confidence edits, while NMS prevents conflicting alterations on overlapping spans.

5. Empirical Performance and Ablative Analyses

5.1 Comparative Results

Model	QALB-14	QALB-15 L1	QALB-15 L2
AraBART+Morph+GEC (2014)	76.20%	78.85%	52.00%
AraBART+Morph+GEC (2015)	77.99%	77.97%	60.98%
AraBART+Morph+GEC (ZAEBUC)	77.85%	77.73%	60.79%
ArbESC+ (all 9 combined)	82.63%	84.64%	65.55%

ArbESC+ outperforms single models by 4–6 F₀.₅ points across all benchmarks, establishing new state-of-the-art performance for Arabic GEC.

5.2 System Combination vs. Baselines

Majority voting, weighted voting, minimum Bayesian risk (MBR), and standard ESC system combinations are all surpassed by ArbESC+ by 1–3 F₀.₅ points on each evaluation split.

5.3 Model Number and Impact

Ablation results show that using only the best 3–5 models achieves F₀.₅ scores of 80.71–80.77 on QALB-14, compared to 82.63 for the full 9-model ArbESC+ system. Including all 9 but without the selection combiner yields F₀.₅=80.78, indicating that the edit-level combination pipeline yields further gains.

5.4 Threshold Sensitivity

The dual-threshold filtering is sensitive: values of $\tau$ below 0.5 admit too many low-quality edits and depress F₀.₅, whereas $\tau$ above 0.9 sacrifices recall. Optimal values of $\tau\approx0.7$ –$0.8$ deliver the strongest results.

5.5 Effect of Morphological Features

AraBART+Morph+GEC’s explicit use of morphological embeddings and parallel GED objectives yields a ≈2 F₀.₅ point improvement over vanilla AraBART, confirming the value of linguistic feature integration for Arabic GEC model proposals.

6. Summary and Significance

AraBART+Morph+GEC augments the standard Arabic BART transformer with detailed morphological features and a grammatical error detection head, producing more accurate and linguistically informed error corrections. Serving as black-box proposal generators within ArbESC+, its outputs are processed via a classifier pipeline that integrates proposals from nine diverse systems, leverages model agreement, filters candidates based on calibrated confidence thresholds, and resolves conflicts via span-level NMS. With final F₀.₅ scores of 82.63%, 84.64%, and 65.55% on the QALB-14, QALB-15 L1, and QALB-15 L2 benchmarks, AraBART+Morph+GEC—especially within ArbESC+—sets a new performance baseline for Arabic grammatical error correction and exemplifies the impact of combining neural and morphological approaches (Alrehili et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

ArbESC+: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AraBART+Morph+GEC.