SGECF: Self-Explainable Grammatical Feedback

Updated 18 April 2026

SGECF is a framework that extends traditional grammar correction by providing explicit error types, evidence spans, and natural language explanations.
It employs diverse methodologies—including generation, labeling, and interaction-based models—to correct errors in both written and spoken L2 contexts.
Evaluation leverages token-level metrics, human feedback ratings, and pseudo-labeling to scale and measure system performance.

Grammatical Error Correction Feedback (SGECF), sometimes termed Self-explainable Grammatical Error Correction Feedback, refers to the generation and evaluation of corrective, pedagogically meaningful responses in both written and spoken second language (L2) learning contexts. SGECF extends beyond conventional grammatical error correction (GEC) by not only correcting errors but also providing learners with explicit, structured feedback—including error types, evidence spans, and natural-language explanations—crucial for effective language acquisition and self-reflection. This domain integrates methods from natural language processing, educational technology, and speech processing, with a particular focus on leveraging large pre-trained models and foundation models to generate, explain, and implicitly evaluate feedback quality in both written and spoken modalities.

1. Core Definitions and Objectives

SGECF focuses on holistic, actionable responses to learner grammatical errors, which comprise:

Correction: Providing the accurate form of an erroneous utterance or text.
Explanation: Identifying the reason for the error, typically as an error type from a linguistic taxonomy.
Evidence identification: Highlighting words or spans in the learner sentence that triggered the correction.

Formally, in the EXPECT experimental protocol for written GEC, the SGECF task receives as input a pair $(X, Y)$ , where $X$ is the erroneous source and $Y$ its corrected counterpart. The SGECF system outputs an error-type label $c\in C$ (from a predefined category set), together with evidence spans $E_X \subseteq \mathrm{tokens}(X)$ that justify the correction (Fei et al., 2023).

In spoken GEC—including end-to-end (E2E) systems—SGECF further addresses challenges due to disfluencies, ASR errors, and lack of explicit alignment between speech and correction. Here, feedback is increasingly derived from edit extraction between disfluent (ASR) or fluently-transcribed hypotheses and GEC-corrected hypotheses, with edits classified using ERRANT-style schemes (Bannò et al., 2023, Qian et al., 27 May 2025, Qian et al., 23 Jun 2025).

2. Data Resources and Annotation Schemes

A central advance is the EXPECT dataset, containing over 21,000 annotated correction instances. Annotations comprise, for each (erroneous, corrected) sentence pair:

A single error-type label drawn from a 15-class taxonomy (including prepositions, verb-tense, collocation, SVA, and others).
One or more evidence token spans in the erroneous sentence.
Gold and system outputs for both raw and GEC-corrected sentences.

Cognitive granularity is enforced by grouping error types (single-word, inter-word morphology/syntax, and discourse-level errors). This annotation enables systems to be trained and evaluated on not just correction, but explainability (evidence extraction + type assignment) (Fei et al., 2023).

In the context of spoken SGECF, annotated speech corpora remain scarce (< 80h labeled L2 speech); consequently, pseudo-labeling approaches are widely used. These automatically generate correction targets and edit annotations via cascaded pipelines, blending ASR, disfluency detection, text-based GEC, and edit extraction using MaxMatch (M²) and ERRANT (Qian et al., 27 May 2025, Qian et al., 23 Jun 2025).

3. System Architectures and Methodological Variants

Written SGECF

Multiple modeling paradigms are adopted:

Generation-based: Seq2seq models (e.g., BART-large) directly generate corrected sentences interleaved with special markers for evidence spans and error-type appendices (Fei et al., 2023).
Labeling-based: Formulated as span-tagging via BERT-like encoders, using concatenated embeddings of source, correction, correction indicators, and optionally syntax. Outputs are label distributions over tokens (evidence/type/O).
Interaction-based: Bi-affine models encode source and correction separately; span-to-span interaction scores and error-type distributions are predicted for each token pair. Integration of syntactic priors (dependency-projected) further boosts precision.

Spoken SGECF

Three main paradigms are prevalent (Bannò et al., 2023, Qian et al., 27 May 2025, Qian et al., 23 Jun 2025):

Cascaded: ASR (Whisper), external disfluency detection (BERT tagger), followed by text-based GEC (BART). Edits and evidence are derived from aligned outputs.
Partial-cascaded: Whisper-fine-tuned for disfluency removal, then text-based GEC.
End-to-End (E2E): Whisper (small, large-v2, etc.) fine-tuned from raw audio directly to GEC-corrected transcripts, optionally conditioned (during training and inference) on prompt text representations (such as fluent transcriptions).

Automatic feedback is generated by diffing fluent and GEC-transcripts and classifying the resulting edits.

Architecture Table: SGECF System Characteristics

System Type	Input	Feedback Source
Written-tagging	(X, Y)	Direct (token/label output)
E2E Spoken	Audio (+prompt)	M² edits between ASR/GEC hypotheses, ERRANT
Cascaded Spoken	Audio → Text steps	Explicit corrections and evidence from pipeline

All system types converge on extracting edits, assigning error types, and surfacing evidence to users, with E2E models currently relying more heavily on post-hoc edit extraction due to architectural opacity.

4. Evaluation Frameworks and Metrics

SGECF adopts both explicit and implicit evaluation strategies.

Explicit Metrics

Token-level precision, recall, $F_1$ , $F_{0.5}$ (with $F_{0.5}$ favored for learner-facing utility) for evidence span extraction and type prediction.
Sentence-level exact match: fraction of sentences for which both evidence and type exactly match gold.
Label accuracy: correct prediction of error type.
Human evaluation: L2 learners' ratings of feedback helpfulness.

Labeling-based SGECF models with syntax augmentation yield test $F_{0.5}\sim69$ %, label accuracy $\sim81.8$ \% (Fei et al., 2023). Human helpfulness is rated $X$ 0– $X$ 1\% for system-/gold-generated explanations.

Implicit (Comparative) Evaluation

The grammatical lineup framework (Bannò et al., 2024) is a key implicit methodology. Here:

For each essay/essay version and grammatical feedback item, a lineup is constructed (e.g., $X$ 2 where each $X$ 3 is a version with $X$ 4\% of errors corrected).
An LLM is prompted to match feedback to the correct essay or vice versa, using "yes/no" binary responses to compute $X$ 5 (essay-based) or $X$ 6 (feedback-based).
Discrimination accuracy is measured as correct identifications over $X$ 7 items.

Key results include $X$ 8\% feedback discrimination accuracy (manual-generated feedback, non-lexical M² lineup), decreasing as foil count increases, but with robust leading-diagonal confusion matrices (Bannò et al., 2024).

Spoken Feedback Evaluation

WER/TER: Standard word error rate/translation edit rate against corrected references.
Feedback $X$ 9 (ERRANT-classified edits): Evaluates edit extraction for feedback rather than only overall transcript correction.
Precision/Recall: For number and accuracy of feedbacked edits.
Reference alignment and filtering: Procedures to prune spurious edits due to ASR or transcription noise; confidence-based edit filtering ( $Y$ 0 optimal for $Y$ 1) (Qian et al., 23 Jun 2025).

5. Scaling, Prompting, and Data Augmentation

A persistent limitation for SGECF, especially in the spoken domain, is the scarcity of labeled data. To expand coverage and training efficacy:

Pseudo-labeling: Automatically generating large-scale training data (raw audio $Y$ 2 ASR $Y$ 3 fluent transcript $Y$ 4 GEC correction) using strong existing models for each subtask. This process scales labeled data volume from $Y$ 5h to $Y$ 6h, substantially boosting E2E system performance (Qian et al., 27 May 2025, Qian et al., 23 Jun 2025).
Prompt-based conditioning: Integrating fluent transcripts as additional context ("prompt text") to the decoder in E2E models. Prompting consistently yields lower WER and higher feedback $Y$ 7, especially in higher proficiency test-takers (Qian et al., 27 May 2025).
Model scaling: Larger Whisper variants (large-v2) outperform smaller ones and surpass strong cascaded baselines for both correction and feedback extraction.

A plausible implication is that as foundation models scale and conditioning techniques mature, E2E SGECF will further absorb formerly cascaded components, eventually enabling direct, interpretable feedback generation from speech with minimal external tagging or alignment.

6. Key Findings, Limitations, and Future Directions

Key findings:

SGECF systems trained with expectation-aligned evidence and type supervision offer strong performance and demonstrable L2 learner utility (Fei et al., 2023).
Feedback generation quality benefits substantially from explicit GEC input and chain-of-thought style prompting, especially in lineup-based evaluation (Bannò et al., 2024).
For spoken SGECF, pseudo-labeling, prompt-based training, and model scaling yield WER at or above strong cascaded baselines ( $Y$ 8\% LNG, $Y$ 9\% S&I), with F $c\in C$ 0 for feedback reaching $c\in C$ 1\% after alignment and filtering (Qian et al., 27 May 2025, Qian et al., 23 Jun 2025).
The grammatical lineup (implicit evaluation) is data-efficient and annotation-agnostic, facilitating flexible benchmarking.

Limitations:

Joint correction-and-explanation seq2seq architectures remain underexplored; current pipelines often separate correction from explainability.
Evidence extraction degrades on longer, more complex sentences and depends critically on parsing accuracy.
Spoken SGECF output alignment and feedback confidence estimation are required to mitigate propagation of ASR errors.

Future research may pursue integrated correction+explanation objectives, improved syntactic/semantic span representations, richer error-type taxonomies (including pragmatic/discourse-level phenomena), foundation model adaptation for multi-task tagging, and evaluation metrics that more precisely reflect learner-perceived feedback utility and the specificities of spoken L2 production (Fei et al., 2023, Bannò et al., 2023, Bannò et al., 2024, Qian et al., 27 May 2025, Qian et al., 23 Jun 2025).

7. Practical Implications and Applications

SGECF enables principled, scalable, and learner-adaptive grammar feedback in educational technology, including:

Automated essay scoring and CALL writing platforms;
Speaking proficiency systems, e.g., Linguaskill and Speak & Improve, supporting dynamic corrective feedback;
Teacher-assistive tools that surface actionable error explanations and error type distributions for summative and formative assessment.

The field is converging on E2E architectures supported by prompt-based contextualization, edit confidence modeling, and lineup-based evaluation for robust, transparent, and pedagogically aligned grammatical feedback generation.