Papers
Topics
Authors
Recent
2000 character limit reached

AES: Automated Essay Scoring System

Updated 23 December 2025
  • Automatic Essay Scoring (AES) is a machine learning system that automates essay evaluation by replicating human grading rubrics.
  • It employs transformer-based architectures like BERTimbau to process large-scale text inputs without manual feature engineering.
  • AES reduces grading bias and turnaround times in high-stakes assessments by ensuring consistent, scalable, and objective scoring.

Automatic Essay Scoring (AES) is a machine learning paradigm designed to automate the evaluation of student essays, aligning as closely as possible with human grading criteria and optimizing for consistency, efficiency, and scalability in large-scale assessment contexts. AES addresses critical challenges in educational systems, including the logistical and financial strain of human grading, inter-rater variability, and slow result delivery. In the context of Brazilian Portuguese and the Exame Nacional do Ensino Médio (ENEM), AES targets the efficient, reliable, and scalable grading of millions of essays across multiple analytic dimensions (Matsuoka, 2023).

1. Problem Definition and Motivating Constraints

The ENEM’s written component is a substantial assessment challenge, requiring each candidate’s essay to be independently evaluated by trained human graders on five analytic axes, or “competencies.” The manual process inflicts considerable logistical and financial burdens by necessitating a national network of well-coordinated annotators. Furthermore, inter-rater disagreement and subjective variation undermine fairness and consistency, potentially impacting student futures. Protracted turnaround times result from manual throughput bottlenecks. AES, by directly emulating the grading rubric and decision structure of human experts, is positioned to deliver significant cost reductions, increased fairness by reducing idiosyncratic bias, and drastically faster response times (Matsuoka, 2023).

2. Dataset and Preprocessing Pipeline

The core resource for Portuguese-language AES is the publicly released Essay-BR corpus, containing 6,577 student essays written in response to 151 authentic ENEM prompts and annotated per the official five-competence ENEM rubric (Matsuoka, 2023). Each essay receives granular scores (0–200, in increments of 40) per competency. The score distribution is right-skewed, with modal grades typically in the range of 120–160 points. Preprocessing proceeds as follows:

  • Prompts are concatenated with essay text using a <SEP> marker to emphasize topic relevance.
  • Tokenization employs the BERTimbau WordPiece tokenizer, generating input IDs, attention masks, and token-type IDs.
  • No manual feature engineering (syntactic, POS, or error-based) is performed; reliance is exclusive on transformer-derived representations.

This preprocessing regime facilitates alignment between the input format and BERTimbau’s pretraining regime, maximizing the potential for transfer of contextual and syntactic representations (Matsuoka, 2023).

3. Neural Model Architecture and Loss Function

The principal scoring model—BERT_ENEM_Regression—extends the 12-layer, 110M-parameter BERTimbau Base architecture. Input essays are processed through the standard transformer pipeline:

  • Input embeddings (including special tokens) are propagated through 12 transformer blocks.
  • Dropout (p=0.3) regularizes the output.
  • A final dense layer of shape (768, 5) projects the pooled embeddings to five output heads corresponding to competencies C1–C5.

There is no explicit engineering of hand-crafted or surface features; all aspects of argumentative quality, grammar, organization, and topic relevance are modeled via BERT’s deep contextual representations. Model training for five epochs (batch size 16, AdamW optimizer) unites standard mean squared error (MSE) regression objectives across all five competence outputs: LossMSE=1Ni=1Nk=15(yi,ky^i,k)2\text{Loss}_\text{MSE} = \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^5 (y_{i,k} - \hat y_{i,k})^2 where yi,ky_{i,k} is the gold score and y^i,k\hat y_{i,k} is the predicted score for competence kk of essay ii. Weight decay via AdamW imposes λθ2\lambda\|\theta\|^2 regularization (Matsuoka, 2023).

4. Evaluation Metrics and Performance Benchmarks

System evaluation employs two primary metrics:

  • Quadratic Weighted Kappa (QWK): Quantifies agreement between the model and human raters, penalizing divergences proportionally to their squared distance over the rating scale. Formally: κ=1i,jwi,joi,ji,jwi,jei,j,wi,j=(ij)2(K1)2\kappa = 1 - \frac{\sum_{i,j} w_{i,j} o_{i,j}}{\sum_{i,j} w_{i,j} e_{i,j}}, \quad w_{i,j} = \frac{(i-j)^2}{(K-1)^2} where oi,jo_{i,j} and ei,je_{i,j} are observed and expected co-occurrence counts for score classes i,ji, j.
  • Root Mean Squared Error (RMSE): Aggregates the square-rooted average squared error across all predicted competence scores.

BERT_ENEM_Regression achieves QWK scores of C1 0.74, C2 0.78, C3 0.76, C4 0.84, C5 0.79, and an overall QWK of 0.79. The corresponding RMSE per competence ranges from 21.77 to 34.03, with an aggregate RMSE of 90.96. Compared to previous Portuguese AES systems (e.g., Amorim & Veloso 2017, Fonseca et al. 2018), this is approximately a 0.26 absolute QWK increase and almost 50% reduction in RMSE. Human–human QWK on similar scoring tasks is typically 0.80–0.85, indicating approach to professional-level consistency (Matsuoka, 2023).

5. Error Analysis and Competency Coverage

Analysis by competency reveals that C2–C5, which span discourse organization and argumentation, are modeled with superior accuracy, likely due to transformer context modeling and the alignment of BERT’s pretraining objectives with argumentative structure detection. Competence 1 (formal written norm) performance is slightly diminished, probably due to the granularity of subword tokenization which may mask fine-grained orthographic or syntactic errors. The skew in the training corpus toward mid/high-range scores biases the model’s outputs away from accurate discrimination of poorly performing essays (Matsuoka, 2023).

Automated scoring confers greater uniformity, reducing variance introduced by transient grader effects, fatigue, or bias—a property that enhances scalability and fairness of high-stakes examination systems.

6. Limitations and Prospective Directions

Key limitations identified by the authors include:

  • Data Imbalance: Underrepresentation of low-scoring essays impedes bottom-end discrimination.
  • Grammatical Error Sensitivity: BERT’s feature learning, in the absence of explicit error signals, may overlook subtle syntactic mistakes.

Future work should seek to:

  • Augment datasets for balanced competence distributions.
  • Integrate explicit error-detection (e.g., grammar-checker submodules) into end-to-end architectures.
  • Investigate ordinal regression and other loss formulations reflecting the discrete nature of scoring scales.
  • Explore multilingual or bilingual AES frameworks, broadening applicability beyond the Portuguese ENEM context (Matsuoka, 2023).

7. Implications for Large-Scale Assessment and Educational Policy

BERT-based AES architectures tuned for high-stakes, multi-competence domains such as the Brazilian ENEM represent a scalable, transparent, and cost-effective alternative to manual marking. By matching or approaching human raters in consistency and reducing inherent subjective bias, AES supports both efficiency and equity in national-level educational evaluation. The system’s reliance on open-source data and straightforward transformer adaptation makes it readily transferable, establishing a template for future deployments in similarly scaled academic settings (Matsuoka, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Automatic Essay Scoring (AES).