FEANEL: Fine-grained Error Analysis Benchmark
- The paper introduces a comprehensive framework for error analysis in K–12 English writing using a 29-category taxonomy and severity scale.
- It employs a rigorous three-stage annotation protocol achieving strong inter-annotator agreement (Cohen’s κ ≈ 0.82 for error types).
- Evaluation of 16 LLMs reveals significant performance gaps versus human annotators, highlighting the need for enhanced feedback mechanisms in educational contexts.
The Fine-grained Error Analysis for English Learners (FEANEL) Benchmark is a rigorously constructed resource designed to evaluate LLMs’ (LLMs) capacity for detailed error analysis of K–12 English writing. FEANEL systematically addresses the annotation of linguistic errors in student essays, supplying a structured taxonomy, severity rubric, and a comprehensive evaluation framework. The benchmark reveals the current limitations and progress of LLMs in delivering nuanced, pedagogically relevant feedback for English language learners (Ye et al., 28 Nov 2025).
1. Benchmark Construction and Dataset Properties
FEANEL comprises 1,000 anonymized essays equally partitioned between two educational cohorts: 500 elementary-school essays (ages 9–11, sourced from global EFL learners via online platforms) and 500 secondary-school essays (ages 12–18, drawn from the Chinese EFL TECCL corpus). Essay prompts were curated to ensure age-appropriateness and topical familiarity in domains such as family, school, hobbies, and travel. Filtration procedures eliminated off-topic, excessively brief, error-free, or privacy-compromising submissions. The post-cleaning corpus exhibits marked differences in lexical density and error frequency: elementary essays average 48.2 words and 6.01 edits per essay, whereas secondary essays average 127.1 words and 11.34 edits per essay, yielding a total of 8,674 error analyses (3,003 elementary, 5,671 secondary).
| Subset | Avg. Words | Avg. Edits | Total Edits |
|---|---|---|---|
| Elementary (500) | 48.2 | 6.01 | 3,003 |
| Secondary (500) | 127.1 | 11.34 | 5,671 |
Expert annotation followed a three-stage protocol:
- Error detection & correction: Minimal edits applied per GEC guidelines, with CLEME used for span extraction.
- Taxonomy development: An initial pilot on ~200 essays led two senior educators to construct a 29-category, part-of-speech-based taxonomy, with top-down prioritization for multi-error scenarios.
- Error analysis: Each error span was independently labeled for type , severity , and an explanatory feedback by two annotators, with disagreements resolved by a third reviewer.
Annotation reliability on a held-out set yielded Cohen’s for type at and for severity at , indicating strong inter-annotator agreement.
2. Error Taxonomy and Severity Rubric
FEANEL’s taxonomy delineates 29 mutually exclusive error types organized into three linguistic strata:
- Single-word errors: Case, space, spelling, contraction.
- Inter-word (morpho-syntactic) errors: Determiner, number, preposition, auxiliary, adjective, adverb, noun number/possessive, general noun/pronoun, subject-verb agreement, nonfinite form, verb voice/tense/choice, part-of-speech confusion, conjunction, relative word, sentence structure, word order, word redundancy.
- Discourse-level errors: Punctuation, format, and sentence redundancy.
Priority is given to the highest-ranking label in multi-error spans, per categorical precedence rules.
Severity is rated on a 1–5 scale reflecting communicative impact:
| Score | Descriptor | Description |
|---|---|---|
| 1 | Trivial | Minor errors, no effect on understanding |
| 2 | Minor | Simple S–V/tense, meaning intact |
| 3 | Moderate | Clause misuse, slightly unclear |
| 4 | Serious | Multi-error, highly confusing |
| 5 | Extremely serious | Sentence unintelligible |
The mean absolute error (MAE) is employed for severity, calculated as
where is the ground-truth severity and the model prediction.
3. Evaluation Protocol and Metrics
Sixteen state-of-the-art LLMs (including GPT-4o, Gemini-2.5-pro, DeepSeek-R1, Claude-3.7 variants, Grok-3, Qwen-3, Llama-3, and Mistral-Small) were assessed under two prompting regimes:
- Zero-shot-naive: Only provides the task prompt and taxonomy names.
- One-shot-detailed: Adds full definitions of all 29 error types, rubric details, and a worked example.
Prompts required fixed output ordering: Severity → Type → Explanation, in a standardized JSON schema.
Core metrics included:
- Error classification: Accuracy (exact match for ), Macro-F1 (average per-type F1 to stress rare categories).
- Severity rating: MAE.
- Error explanation: BLEU, METEOR, and ROUGE-L, computed against expert-generated explanations.
4. Experimental Findings
LLMs underperformed human annotators (human: ≈80% Accuracy, 76% Macro-F1, BLEU≈5.2), with notable sensitivity to prompt enrichment.
- Zero-shot-naive: Average Accuracy ≈63%, Macro-F1 ≈68%, MAE ≈0.80, BLEU ≈1.5, ROUGE-L ≈25.
- One-shot-detailed: Average Accuracy ≈67%, Macro-F1 ≈71%, MAE ≈0.86, BLEU ≈2.4, ROUGE-L ≈28.
Best results by subtask:
| Subtask | Best Model | Metric (One-shot-detailed) |
|---|---|---|
| Classification | Gemini-2.5-pro | Acc 72%, F₁ 76% |
| Severity | Claude-3.7 | Best in one-shot setting |
| Explanation | Gemini-2.5-pro | BLEU 4.29, ROUGE-L 31.36 |
Elementary essays proved 2–6 points harder for classification, attributed to more compound/multi-span edits. Across all groups, Macro-F1 trailed simple accuracy by 10–15 points, highlighting persistent deficits in handling tail categories.
Analysis revealed strong model performance on frequent, localizable errors (Case, Space, Spelling >90% Accuracy), contrasted by poor detection on rarer or structurally complex types (Contraction, Number, POS Confusion, Sentence Structure, Format <40–50%). Order of subtask outputs modulated score trade-offs: producing explanations before type/severity labels benefited classification in zero-shot settings but reduced explanation quality, with the converse holding in one-shot.
A representative case illustrated three recurring failure modes:
- Assigning the most salient label in multi-error scenarios, omitting secondary errors.
- Misapplication of nuanced taxonomy distinctions.
- Formatting inconsistencies, impairing automated downstream processing.
5. Key Limitations and Error Analysis
Despite notable advances, state-of-the-art LLMs exhibit several shortcomings on FEANEL:
- Consistently lagging human accuracy and F1, especially for discourse-level and rare error categories.
- Sensitivity to prompt completeness and output order, with fluctuating task interdependence.
- Frequent inability to provide concise, adaptive, or pedagogy-aware explanations.
- Failures in reliably distinguishing overlapping taxonomy categories in complex spans.
Structural misclassifications and inaccurate severity calibration undermine pedagogical utility, particularly when feedback is incorporated into classroom analytics or tutoring systems.
6. Pedagogical Implications and Prospects for Advancement
Findings emphasize that effective automated feedback in language learning contexts requires:
- Rich, definition-inclusive prompts; taxonomy names alone are inadequate for eliciting fine-grained LLM performance in educational tasks.
- Enhanced explanation capabilities that offer both precision and adaptability, suggesting a need for systems embracing learner-aware, iterative feedback mechanisms.
- Strict fidelity to standardized taxonomy and formatting for interoperability with digital education pipelines.
Prospective research directions articulated include:
- Extending FEANEL’s coverage to encompass broader age groups, diverse genres, and additional languages (development of multilingual error taxonomies).
- Advancing hybrid and alignment strategies for addressing rare class detection, such as targeted fine-tuning or symbolic-LLM integration.
- Implementing human-in-the-loop evaluations focusing on subjective feedback assessment, readability, and measurable learning gains beyond automated metrics.
- Exploring multimodal learning scenarios and adaptive scaffolding, wherein model outputs evolve responsively to student revision cycles.
By delivering a comprehensive dataset, a finely granulated error taxonomy, and rigorous benchmark settings, FEANEL establishes a foundation for concerted research targeting the current inadequacies of LLMs in educational feedback, facilitating progress toward contextually-aware, effective writing support for K–12 English learners (Ye et al., 28 Nov 2025).