Automated Essay Scoring (AES)

Updated 21 February 2026

Automated Essay Scoring (AES) is a computational method that evaluates essays using statistical, machine learning, and deep learning techniques for robust grading.
AES has evolved from rule-based and feature-engineered systems to advanced transformer architectures that automatically extract semantic and structural features.
Recent AES systems emphasize score alignment, fairness, and adaptive feedback generation by leveraging multimodal data and hybrid modeling strategies.

Automated Essay Scoring (AES) is the application of computational techniques—statistical, machine learning, and more recently deep learning—to evaluate and assign scores to essays written in natural language. AES frameworks aim to match the reliability and validity of human raters, providing scalability, consistency, and operational efficiency in large-scale educational assessments, language proficiency testing, and formative classroom feedback.

1. Conceptual Foundations and Evolution

AES research originated with rule-based and feature-engineered systems focusing on surface metrics such as essay length, word frequency, and grammar error rate. ETS’s e-rater and IntelliMetric exemplify proprietary early AES engines. These systems leveraged manually engineered linguistic, syntactic, and discourse features predictive of writing quality, often in regression or classification frameworks (SVM, Random Forest) (Vajjala, 2016 Nigam, 2017).

The past decade has seen a transition towards deep learning: models leveraging CNNs or RNNs extract semantic and sequential features automatically, reducing the need for manual feature design (Tashu et al., 2022 Chakravarty, 17 Aug 2025). Transformer architectures now dominate state-of-the-art AES, exhibiting strong performance, efficient scaling, and adaptability across multiple languages and rating rubrics (Ormerod et al., 2021 Ludwig et al., 2021 Matsuoka, 2023).

2. Core Methodologies and Architectures

AES systems can be characterized along two primary axes: feature-engineered versus end-to-end deep learning, and holistic versus multi-dimensional (trait-level) scoring.

Feature-Engineered Models: These extract features such as lexical diversity (TTR, MTLD), syntactic complexity (clause length, nominal structures), discourse coherence (entity-grid transitions, coreference chains), and error rates (spelling/grammar density) (Vajjala, 2016 Nigam, 2017). These are fed into linear regressors, SVMs, or tree-based models (Marinho et al., 2021).

Deep and Transformer-based Models:

CNN-RNN Fusion: CNNs extract n-gram or local semantic features, which are further contextualized by RNN/BGRU to capture long-range dependencies (Tashu et al., 2022).
Transformers: Self-attention mechanisms enable global receptive fields, crucial for modeling discourse-level structure and prompt relevance. Variants like Longformer, RoBERTa, and BERTimbau (for Portuguese) have been successfully fine-tuned for AES tasks (Chakravarty, 17 Aug 2025 Matsuoka, 2023).
Multitask/Multi-head Models: Systems now often predict per-dimension/trait scores (e.g., cohesion, grammar, vocabulary, argumentation) via joint heads, sometimes using joint loss formulations combining cross-entropy, regression, and contrastive constraints (Sun et al., 2024 Do et al., 2023).

Score Alignment and Calibration: Post-hoc score alignment improves congruence between predicted and true score distributions, especially valuable when models systematically over- or under-predict extremes (Choi et al., 2 Feb 2026).

Hybrid and Multi-Agent Approaches: Methods such as the CAFES framework orchestrate multiple agents (initial scorer, feedback pool manager, reflective scorer) using multimodal LLMs (MLLMs) to combine fast trait-level assessments, evidence-based feedback, and iterative refinement, with empirical gains in QWK and human alignment (Su et al., 20 May 2025).

Cross-Prompt and Generalizability: Recent advances target robustness to novel prompts by incorporating prompt-aware encoders, topic coherence features, and trait-similarity losses, supporting zero-shot or few-shot transfer (Do et al., 2023 Azurmendi et al., 9 Dec 2025).

3. Datasets, Annotation Schemes, and Multilingual Adaptation

Benchmark AES corpora range from English-centric datasets (ASAP, PERSUADE) to recent resources in Portuguese (Essay-BR, ENEM), Basque (HABE C1), Arabic (AR-AES, synthetic CEFR-annotated), and others (Marinho et al., 2021 Matsuoka, 2023 Azurmendi et al., 9 Dec 2025 Ghazawi et al., 2024 Qwaider et al., 22 Mar 2025).

Key annotation dimensions include:

Holistic Scores: A single ordinal score for overall essay quality.
Analytic/Dimension-Level Scores: Multiple trait scores per essay, reflecting rubric dimensions such as grammar, coherence, argumentation, and task relevance (Su et al., 17 Feb 2025 Sun et al., 2024).
Multimodal Annotations: EssayJudge introduces visual prompts (charts, diagrams) to support research on multimodal AES (Su et al., 17 Feb 2025).

Human annotation protocols generally rely on expert raters, with double marking, inter-rater agreement analysis (Cohen’s κ, QWK), and consensus arbitration for large disagreements (Marinho et al., 2021 Su et al., 17 Feb 2025).

Multilingual AES research introduces challenges in feature sparsity (due to inflectional morphology), flexible word order, prevalence of passive constructions, and scoring scale normalization. Transformer-based encoders pretrained on language-specific corpora (e.g., BERTimbau for Portuguese, CAMeLBERT-MSA for Arabic) enable direct modeling without translation or hand-engineered adaptation (Matsuoka, 2023 Qwaider et al., 22 Mar 2025).

4. Evaluation Metrics and Empirical Results

The dominant evaluation metric in AES is Quadratic Weighted Kappa (QWK), which robustly summarizes agreement between model and human scores while penalizing large errors (Yang et al., 2024 Marinho et al., 2021).

Metric	Formula / Description
QWK	$\kappa = 1 - \frac{\sum_{i,j} w_{ij}\,O_{ij}}{\sum_{i,j} w_{ij}\,E_{ij}}$ , weights $w_{ij}$ quadratic
Pearson’s r	Linear correlation coefficient between model and gold
F1	Harmonic mean of precision and recall for classification
MAE/RMSE	Mean Absolute or Root Mean Squared Error on gold prediction
Demographic Parity/EO	Fairness measures based on subgroup error statistics

Empirical QWKs exceeding 0.80 are typical for state-of-the-art transformer or CNN-GRU models scored holistically or on trait dimensions, with human-level agreement approaching 0.85–0.90 on domain-constrained tasks (Sun et al., 2024 Matsuoka, 2023 Tashu et al., 2022 Su et al., 20 May 2025). Error analysis reveals trait-level and prompt-specific weaknesses, particularly in discourse coherence, argument clarity, and generalizability across unseen prompts or genres (Su et al., 17 Feb 2025 Azurmendi et al., 9 Dec 2025).

Recent studies emphasize not only accuracy but also fairness—with certain models (notably prompt-specific transformers) exhibiting demographic bias amplification, especially regarding economic status, while cross-prompt and feature-engineered SVMs often yield fairer predictions (Yang et al., 2024).

5. Feedback Generation and Interpretability

Modern AES systems are increasingly designed to produce actionable formative feedback beyond scalar scores. Techniques include:

Intermediate Annotation: Automated tagging of argumentative components (claims, evidence, rebuttals) and error types (grammar, spelling, cohesion) improve both scoring performance and feedback quality (Ormerod, 28 May 2025).
Explanatory Output: Multi-agent systems (e.g., CAFES) yield explicit per-trait feedback rationales supporting human alignment (Su et al., 20 May 2025).
Pedagogical Evaluation: Fine-tuned generative LMs (e.g. Latxa 70B for Basque) supply criterial feedback and error exemplars, with expert-validated high fidelity and extraction accuracy (Azurmendi et al., 9 Dec 2025).

Empirical studies confirm that annotative augmentations improve QWK and exact agreement rates, and support individualized interventions in language learning (Ormerod, 28 May 2025 Azurmendi et al., 9 Dec 2025).

6. Frontiers: Multimodality, Context Enrichment, and Robustness

AES has moved beyond text-only settings. Recent frameworks handle multimodal prompts (images, diagrams) using MLLMs, zero-shot rubric prompting, and source grounding (Su et al., 17 Feb 2025 Su et al., 20 May 2025). Contextual enrichment—incorporating margin-ranking losses, prompt and document structure, and discourse markers as token-level cues—improves ordinal discrimination and generalization without architectural changes (Chakravarty, 17 Aug 2025 Hardy, 2021).

Robustness to adversarial samples (sentence permutation, prompt irrelevance) is substantially improved by blending deep encoders with explicit coherence and relevance scorers, hybridizing neural and feature-based approaches (Liu et al., 2019). Synthetic data generation and controlled error injection, especially in low-resource languages, have proven effective for data augmentation and scaling (Qwaider et al., 22 Mar 2025).

7. Limitations, Fairness, and Emerging Directions

Key limitations in AES remain:

Discourse-level Weakness: Even leading MLLMs lag human raters in capturing argument structure, essay length reliability, and cross-modal reasoning (Su et al., 17 Feb 2025).
Fairness and Bias: Prompt-specific models can encode demographic bias present in training data. Explicit fairness diagnostics and, where necessary, in-processing or post-processing bias mitigation are essential for deployment in high-stakes contexts (Yang et al., 2024).
Generalizability: Cross-prompt transfer lags prompt-specific scoring in QWK but often improves fairness. Recent advances in attention mechanisms, trait-similarity losses, and trait-adaptive fine-tuning offer scalable solutions (Do et al., 2023).

Future research is directed toward:

Better cross-modal and discourse modeling (integrating graph neural networks, argument mining)
Explainable, criterion-aligned feedback generation
Wider multilingual coverage and adaptive scoring rubrics
Human–AI collaboration interfaces to calibrate, review, and trust model outputs at scale.

References

(Vajjala, 2016) Vajjala, S. (2016). Automated assessment of non-native learner essays: Investigating the role of linguistic features.
(Nigam, 2017) Nigam, A. (2017). Exploring Automated Essay Scoring for Nonnative English Speakers.
(Liu et al., 2019) Xia, F. et al. (2019). Automated Essay Scoring based on Two-Stage Learning.
(Ormerod et al., 2021) Farag, Y. et al. (2021). Automated essay scoring using efficient transformer-based LLMs.
(Marinho et al., 2021) Anchieta, R. et al. (2021). Essay-BR: a Brazilian Corpus of Essays.
(Ludwig et al., 2021) Ludwig, L. et al. (2021). Automated Essay Scoring Using Transformer Models.
(Hardy, 2021) West-Torrance, B. et al. (2021). Toward Educator-focused Automated Scoring Systems for Reading and Writing.
(Tashu et al., 2022) Parikh, G. et al. (2022). Deep Learning Architecture for Automatic Essay Scoring.
(Do et al., 2023) Lee, M. et al. (2023). Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring.
(Matsuoka, 2023) Souza, F. et al. (2023). Automatic Essay Scoring in a Brazilian Scenario.
(Yang et al., 2024) Yang, Y. et al. (2024). Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability.
(Sun et al., 2024) Ma, B. et al. (2024). Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression.
(Ghazawi et al., 2024) Alharbi, M. et al. (2024). Automated essay scoring in Arabic: a dataset and analysis of a BERT-based system.
(Su et al., 17 Feb 2025) Chen, J. et al. (2025). EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal LLMs.
(Qwaider et al., 22 Mar 2025) Qwaider, M. et al. (2025). Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection.
(Su et al., 20 May 2025) Liu, Z. et al. (2025). CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring.
(Ormerod, 28 May 2025) Li, Z. et al. (2025). Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems.
(Chakravarty, 17 Aug 2025) Alves, J. et al. (2025). Empirical Analysis of the Effect of Context in the Task of Automated Essay Scoring in Transformer-Based Models.
(Azurmendi et al., 9 Dec 2025) Aritzeta, A. et al. (2025). Automatic Essay Scoring and Feedback Generation in Basque Language Learning.
(Choi et al., 2 Feb 2026) Wang, S. et al. (2026). Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training.