Arab Enhanced Edit Selection System (ArbESC+)
- The paper introduces ArbESC+, the first Arabic GEC framework that fuses diverse neural and edit-based models for error correction on benchmark datasets.
- It employs a dual-stage pipeline where correction proposals are generated and refined using a logistic regression classifier with consensus boosting and dual-threshold filtering.
- Evaluations demonstrate significant F0.5 score improvements over individual models, validating its effective conflict resolution and inter-system agreement techniques.
Arab Enhanced Edit Selection System Complication (ArbESC+) is a multi-system computational framework designed for grammatical error correction (GEC) in Arabic, integrating diverse neural and edit-based correction proposals within a principled selection and conflict-resolution regime. ArbESC+ is notable for being the first Arabic GEC system to combine multiple state-of-the-art models using an explicit edit-selection classifier, consensus boosting, and span-level non-maximum suppression (NMS), achieving leading performance on benchmark datasets QALB-14 and QALB-15 (Alrehili et al., 18 Nov 2025). The system leverages aggregation of edit proposals, numerically encodes system agreement, and applies learned decision rules to maximize correction accuracy while resolving overlaps.
1. Pipeline Structure and Model Integration
ArbESC+ is architected as a two-stage pipeline:
- Correction Proposal Generation: Multiple GEC systems independently generate corrected hypotheses for each input sentence. These systems comprise:
- Seq2Seq Transformers: AraT5, ByT5, mT5 (each fine-tuned jointly on QALB-14, QALB-15, and ZAEBUC corpus), and AraBART (base).
- Morphology-Aware Variant: AraBART+Morph+GEC13 (trained on QALB-14, QALB-15, and ZAEBUC), utilizing morphologically preprocessed input and grammatical-error-detection (GED) labels.
- Text Editing System: Sequence labeler using “NoPnx” and “Pnx” taggers (AraBERTv02-based) for token-level operations (Keep/Delete/Replace/Insert), specialized for punctuation vs. non-punctuation.
Each of the six systems (nine model instances total) outputs a corrected hypothesis for source sentence .
- Edit Extraction, Encoding, and Selection: Hypotheses are aligned to the source to extract “micro-edits” represented as tuples , where are replaced with string . Each edit is labeled by type (insert, replace, delete). Candidate edits from all systems are aggregated to form the unified set .
Edits are then encoded as feature vectors and subjected to a learning-based selection, conflict filtering, and application process.
2. Numerical Feature Representation and Agreement Metrics
Each micro-edit is represented through a set of engineered features:
- One-hot Model×Type Indicators: , with iff hypothesis proposes edit with type .
- System Agreement Count: , quantifying the number of systems that independently suggest .
- Edit Span IoU (Overlap): For two edits , , their span intersection-over-union is
This metric is central to span-based NMS.
The feature representation enables the classifier to exploit both the provenance of each edit and the inter-system agreement.
3. Classifier-Based Edit Selection and Decision Process
A logistic regression classifier operates on all candidate edit vectors to compute raw selection probabilities: with the learned parameter vector, the bias, and the sigmoid function. Training minimizes binary cross-entropy loss over the corpus gold edits.
Consensus Reinforcement (“Agreement Boosting”): To reward edits proposed by multiple systems, a capped-linear boost is applied: Adjusted probabilities: where and are hyperparameters.
Dual-threshold Filtering: Edits are retained only if both and (with , ), enforcing base confidence and consensus criteria.
Conflict Resolution via NMS: After filtering, surviving edits are sorted by and greedily selected under two constraints:
- At most one insertion per position.
- No two edits with (span overlap threshold typically $0.0$–$0.3$).
Edits are then applied left-to-right to the source to produce the final corrected output .
| Step | Operation | Constraints |
|---|---|---|
| Edit Extraction | Align hypotheses to source, extract | None |
| Feature Encoding | Build (modeltype one-hot), | |
| Scoring | Compute , apply boosting | , hyperparameters |
| Dual-Threshold Filtering | thresholds | , |
| NMS Conflict Elimination | IoU-based suppression, insertion uniqueness | (span overlap) |
4. Evaluation Metrics and Empirical Performance
ArbESC+ is evaluated using the standard MaxMatch () metric with , as defined: with precision (# correct selected edits)/(# proposed edits), and recall (# correct selected edits)/(# gold edits).
ArbESC+ achieves:
- QALB-14 L1 Test: (Precision , Recall )
- QALB-15 L1 Test: (Precision , Recall )
- QALB-15 L2 Test: (Precision , Recall )
Compared to the best single model (AraT5: , , ), ArbESC+ yields absolute improvements of , , and points on , respectively. Detailed error-type analysis across ARETA’s taxonomy (Merge, Morphology, Orthography, Punctuation, Semantic, Split, Syntax) shows ArbESC+ marginally outperforms individual systems in all categories on QALB-14 and QALB-15 L1, and is more stable on challenging L2 data (e.g., +3 points on Syntax over AraT5).
5. Conflict Handling and Support Techniques
The suite of support and conflict-resolution techniques addresses the inherent challenge of overlapping and inconsistent multi-system edits:
- Span-Level NMS: Ensures only non-overlapping span edits survive, maximizing global consistency by permitting the highest-confidence proposals and suppressing conflicting alternatives.
- Unique Insertions: At most one insertion at any position, avoiding redundant or contradictory local changes.
- Agreement Boosting: Explicitly increases preference for corrections supported by multiple systems, providing a reliability estimate based on intermodel consensus.
- Dual Thresholds: Avoids over-reliance on either model confidence or consensus alone by requiring both criteria for edit retention.
These mechanisms collectively enforce coherent selection while leveraging complementary system outputs.
6. Linguistic Error Typology and Extendability
Each edit is associated with a coarse-grained edit type , derived from the Bryant et al. (2017) schema and encoded within the classifier. This enables the model to, for instance, learn differential reliability across model/type pairs (e.g., replacements from system may be favored over insertions from system ). While the current system restricts to three types, extension to finer linguistic categories (e.g., “morphological agreement error,” “preposition misuse,” “definiteness”) is natural by expanding 𝒯 and the associated feature representation.
Example Corrections:
- Orthographic: “المكتبه” → AraT5 proposes replacement “المكتبه”→“المكتبة.”
- Morphological: “ذهبت الى السوقون” → AraBART+Morph suggests replacement to “الأسواقِ.”
- Syntactic: Text-editing system inserts missing conjunction “و.”
This type tagging supports integration with detailed error taxonomies and linguistic analysis.
7. Comparative Significance and Contributions
ArbESC+ represents the first Arabic GEC approach to harness system combination with principled, learned edit selection and explicit conflict management. By pooling the strengths of full-sentence seq2seq models (deep syntactic and semantic repairs) with fine-grained edit-based systems (surgical token-level fixes), then employing statistical, data-driven selection and consensus-based reliability estimation, ArbESC+ establishes new performance highs for Arabic GEC, with uplift even on learner-generated L2 data (Alrehili et al., 18 Nov 2025). This framework is extendable to finer linguistic categories, supporting future research and system development in Arabic text processing.