AraBART+Morph+GEC for Arabic Grammatical Correction
- The paper introduces AraBART+Morph+GEC, which combines a BART-based encoder-decoder model with detailed morphological embeddings and a GED objective to improve Arabic error correction.
- It employs a refined edit selection pipeline using logistic regression, agreement boosting, and non-maximum suppression, achieving up to 84.64% F₀.₅ on QALB-15 benchmarks.
- Serving as a key component of the ArbESC+ ensemble, the system leverages both neural and linguistic features to set a new state-of-the-art for Arabic grammatical error correction.
AraBART+Morph+GEC is an Arabic grammatical error correction (GEC) system integrating a BART-based sequence-to-sequence architecture, explicit morphological analysis, and a parallel grammatical error detection (GED) objective. Developed as a key component of the ArbESC+ multi-system edit selection framework, it leverages both neural and linguistic features to address the challenges of morphologically rich and syntactically complex Arabic text. The system combines span-based edit proposals from independently trained variants, enabling fine-grained correction decisions within a larger ensemble strategy (Alrehili et al., 18 Nov 2025).
1. Architecture of AraBART+Morph+GEC
1.1 Base AraBART Backbone
AraBART employs the encoder–decoder “denoising” transformer originally proposed by Lewis et al. (2019), re-pretrained on extensive Arabic corpora as described in Antoun et al. (2020). Arabic-specific modifications include a BPE vocabulary (approximately 42,000 tokens), script-level adjustments for right-to-left text, and orthographic normalization. Pretraining objectives follow standard BART, encompassing masked token infilling, masked span infilling, and next-sentence prediction, all adapted for Arabic data.
1.2 Morphological Feature Integration
Morphological information is introduced via CAMeL Tools’ MADA+ analyzer. For each input token position , discrete features are extracted:
- POS tag
- Stem
- Root
- Additional attributes (number, gender, case, etc.)
These features are embedded as follows:
The encoder’s input vector at each position is
Optionally, internal layers inject morphological embeddings into the multi-head self-attention keys and values,
enabling direct incorporation of morphological cues in attention computations.
1.3 GEC-specific Multi-task Objectives
The model is trained for both text generation and error detection:
- Sequence generation with standard cross-entropy loss:
- Grammatical error detection (GED):
where is the GED logit for token , is its gold label, and is the sigmoid. The full objective is with .
1.4 Training Regime and Hyperparameters
- Data: QALB-2014, QALB-2015, ZAEBUC corpora (joint/separate variants)
- Optimization: AdamW, learning rate , weight decay 0.01
- Batch size: 16, mixed precision (fp16)
- Training epochs: 50, early stopping via development set
- Inference: beam search, beam size 5, max output length 100
2. Generation and Featureization of Correction Proposals
2.1 Candidate Edit Extraction
At inference, three independently trained AraBART+Morph+GEC models (corresponding to QALB-14, QALB-15, and ZAEBUC domains) generate corrected sentences. Source-to-output alignments yield proposed span edits , interpreted as replacements of source tokens by string .
2.2 Numerical Feature Representation
For each edit, multiple features are computed:
- System confidence: For a proposal from system , the normalized probability mass assigned to edits containing across output beams,
- Morphological consistency: measures alignment between the replacement and MADA+ predicted gold features:
- Span features: Size of the replaced span and length of .
3. Edit Selection: Classifier and Decision Pipeline
3.1 Feature Vector Construction
A binary feature vector (where is the number of systems, is the number of edit types) specifies which system(s) proposed of each type (insertion, deletion, substitution). Optionally, real-valued meta-features (system confidence, morphological consistency, span length) are appended.
3.2 Logistic Regression Scoring
Each candidate receives a raw probability score via logistic regression,
optimized with binary cross-entropy on labeled edits.
3.3 Agreement Boosting and Dual-Threshold Filtering
System agreement is quantified:
- Boost factor:
- Adjusted score:
Candidate is accepted if and , enforcing both raw confidence and agreement.
3.4 Non-Maximum Suppression for Conflict Resolution
Inter-edit overlap is measured by one-dimensional IoU:
A greedy non-maximum suppression (NMS) procedure selects highest edits while ensuring non-overlapping spans (threshold ), with at most one insertion per position.
4. System Combination in ArbESC+ Framework
4.1 Model Ensemble
The full ArbESC+ system integrates:
- Four sequence-to-sequence GEC models: AraT5, ByT5, mT5, AraBART
- Three AraBART+Morph+GEC models (trained on QALB-14, QALB-15, ZAEBUC)
- Two text-editing models
This ensemble yields candidate outputs per sentence.
4.2 Combination and Decision Pipeline
The ensemble workflow is as follows:
- Aggregate unique span edits from all $9$ systems.
- Encode features for each edit as described above.
- Score with logistic regression.
- Apply agreement boosting and dual-threshold filtering.
- Resolve conflicts via NMS.
- Sequentially apply surviving edits to the left-to-right source.
4.3 Rationale for Micro-edit Level Combination
Micro-edit based voting enables fine-grained error correction where edits, rather than whole sentences, are the central decision unit. This enables contributions from high-confidence system components even when they disagree on overall sentence structure. Thresholding and agreement-based boosting limit spurious or low-confidence edits, while NMS prevents conflicting alterations on overlapping spans.
5. Empirical Performance and Ablative Analyses
5.1 Comparative Results
| Model | QALB-14 | QALB-15 L1 | QALB-15 L2 |
|---|---|---|---|
| AraBART+Morph+GEC (2014) | 76.20% | 78.85% | 52.00% |
| AraBART+Morph+GEC (2015) | 77.99% | 77.97% | 60.98% |
| AraBART+Morph+GEC (ZAEBUC) | 77.85% | 77.73% | 60.79% |
| ArbESC+ (all 9 combined) | 82.63% | 84.64% | 65.55% |
ArbESC+ outperforms single models by 4–6 F₀.₅ points across all benchmarks, establishing new state-of-the-art performance for Arabic GEC.
5.2 System Combination vs. Baselines
Majority voting, weighted voting, minimum Bayesian risk (MBR), and standard ESC system combinations are all surpassed by ArbESC+ by 1–3 F₀.₅ points on each evaluation split.
5.3 Model Number and Impact
Ablation results show that using only the best 3–5 models achieves F₀.₅ scores of 80.71–80.77 on QALB-14, compared to 82.63 for the full 9-model ArbESC+ system. Including all 9 but without the selection combiner yields F₀.₅=80.78, indicating that the edit-level combination pipeline yields further gains.
5.4 Threshold Sensitivity
The dual-threshold filtering is sensitive: values of below 0.5 admit too many low-quality edits and depress F₀.₅, whereas above 0.9 sacrifices recall. Optimal values of –$0.8$ deliver the strongest results.
5.5 Effect of Morphological Features
AraBART+Morph+GEC’s explicit use of morphological embeddings and parallel GED objectives yields a ≈2 F₀.₅ point improvement over vanilla AraBART, confirming the value of linguistic feature integration for Arabic GEC model proposals.
6. Summary and Significance
AraBART+Morph+GEC augments the standard Arabic BART transformer with detailed morphological features and a grammatical error detection head, producing more accurate and linguistically informed error corrections. Serving as black-box proposal generators within ArbESC+, its outputs are processed via a classifier pipeline that integrates proposals from nine diverse systems, leverages model agreement, filters candidates based on calibrated confidence thresholds, and resolves conflicts via span-level NMS. With final F₀.₅ scores of 82.63%, 84.64%, and 65.55% on the QALB-14, QALB-15 L1, and QALB-15 L2 benchmarks, AraBART+Morph+GEC—especially within ArbESC+—sets a new performance baseline for Arabic grammatical error correction and exemplifies the impact of combining neural and morphological approaches (Alrehili et al., 18 Nov 2025).