Cross-Prompt Essay Scoring

Updated 29 December 2025

Cross-Prompt Essay Scoring is an automated approach that generalizes evaluation across unseen prompts using invariant, prompt-agnostic features.
It employs comparative judgment techniques, LLM-based analyses, and hybrid feature engineering to enhance robustness and fairness.
Evaluated primarily via QWK and fairness metrics, this method addresses domain shifts while reducing biases in diverse educational contexts.

Cross-Prompt Essay Scoring refers to the automated evaluation of essay quality on essays written in response to prompts not seen during training, aiming to generalize performance across diverse topics, genres, and rubrics without requiring new prompt-specific labels or model retraining. It addresses a central challenge in large-scale educational assessment: deploying scoring systems that maintain accuracy, robustness, and fairness across varied and novel writing tasks.

1. Theoretical Foundations and Motivation

Cross-prompt essay scoring was motivated by the impracticality of fine-tuning models for each new prompt, given the combinatorial diversity of essay topics and rubrics in real educational contexts. Foundational statistical models for relative judgment, such as the Thurstone and Bradley–Terry frameworks, have been adapted for this setting, modeling each essay with a latent quality parameter $\lambda_i$ so that the probability Essay A is superior to Essay B is: $P(A \succ B \mid \lambda_A, \lambda_B) = \frac{\exp(\lambda_A - \lambda_B)}{1 + \exp(\lambda_A - \lambda_B)}$ (Kim et al., 2024). This comparative framework underlies modern approaches that eschew absolute, prompt-specific scoring in favor of prompt-agnostic, pairwise, or feature-based methods.

The task's significance also connects to practical constraints: prompt-specific AES models are vulnerable to domain shift, are susceptible to demographic and rubric biases, and often generalize poorly—especially for under-resourced prompts, minority populations, or traits insufficiently represented during model development (Yang et al., 2024). Thus, research has focused on methods to extract, learn, or transfer features and inductive biases that are invariant to prompt, topic, or genre while preserving discriminative power for both holistic and analytic traits.

2. Core Methodological Approaches

Comparative Judgment (CJ)

CJ leverages pairwise, zero-shot essay comparisons using LLMs (notably GPT-4), prompting the model to select the superior essay without needing explicit trait scoring, calibration, or tuning to the prompt (Kim et al., 2024). After compiling all pairwise outcomes, a Bradley–Terry model is fit to estimate continuous essay quality scores, which are linearly transformed to rubric scales. This method is prompt-agnostic, cognitively aligned with human rater strategies, and shown to outperform direct rubric-based prompting in QWK (+18.9% for GPT-4) (Kim et al., 2024).

Prompt-/Trait-invariant Feature Engineering

Early cross-prompt AES systems focused on replacing prompt-sensitive lexical/semantic features by encoding text via syntactic patterns (e.g., POS n-grams), classic readability and complexity metrics, and prompt-agnostic linguistic features such as sentence length, clause density, and sentiment (Ridley et al., 2020, Cozma et al., 2018). Models such as PAES and SVM(Reduced) operate without access to prompt labels or essays, achieving robust generalization at the cost of a modest QWK drop (~9%) relative to prompt-specific fine-tuned models (Yang et al., 2024, Ridley et al., 2020).

Prompt-aware and Multi-trait Deep Architectures

Recent neural approaches explicitly model essay–prompt interactions, enabling prompt-adherence measurement and trait-wise generalization:

Prompt- and Trait Relation-aware Models: ProTACT encodes both prompt and essay via attention mechanisms, fusing them with topic coherence features generated via topic modeling (LDA) and enforcing trait-similarity constraints at the loss level (Do et al., 2023). This design directly encodes prompt text at inference, boosting prompt-relevant traits by up to +0.051 QWK.
Grammar-aware Architectures: GAPS integrates original and grammar-corrected essay texts (via GEC models) in a dual-stream, knowledge-sharing architecture, focusing the model on syntactic, prompt-invariant dimensions (Do et al., 12 Feb 2025). It achieves notable improvements especially for conventions, fluency, and under low-resource prompts (QWK +0.02 on prompt 7).
Trait-Specific Pipelines: TRATES uses LLMs for rubric-driven feature extraction (e.g., generating trait-specific questions and scoring them via prompt-injected LLMs), fuses these with generic and prompt-metadata features, and applies lightweight regressors per trait. It outperforms previous SOTA by +0.020 QWK and establishes feature importance rankings, with LLM-derived trait features as the largest (–7.6 QWK ablation) (Eltanbouly et al., 20 May 2025).

Prompt- and Topic-aware Soft Prompting/Adversarial Alignment

Methods such as ATOP combine adversarial alignment (Topic Discriminator + Gradient Reversal Layer) and learnable shared/specific soft prompts to jointly elicit topic-invariant and topic-specific essay features (Zhang et al., 8 Aug 2025). Pseudo-labeling via neighbor-based classification on the target domain supports topic-sensitive adaptation, yielding further improvements over contrastive (PMAES) and meta-learned (PLAES) baselines (+3.8% QWK).

LLM-based Approaches and Hybrid Prompt Engineering

Zero-shot and few-shot scoring via instruction-tuned LLMs (e.g., ChatGPT, Claude, open-source Llama variants) are competitive on source-dependent tasks but underperform SOTA in generalization and trait stability; prompt engineering (e.g., supplying rubrics, worked examples, or chain-of-thought rationales) critically affects results (Mansour et al., 2024, Hou et al., 13 Feb 2025). Hybrid approaches inject explanatory, high-correlation linguistic features (e.g., unique word count, lemma count, hard words) as explicit tokens within the scoring prompt, leading to measurable gains (up to +0.038 QWK for Mistral-7B) even in out-of-domain scenarios (Hou et al., 13 Feb 2025).

Activation Probing and Model Merging

Recent work has shifted toward exploiting the internal activations of LLMs as prompt-agnostic scoring features (Chi et al., 22 Dec 2025). Probing attention head activations via linear regression recovers trait and holistic scores with high fidelity (QWK 0.654–0.709), with mid-layer heads encoding the most generalizable signals. Source-free adaptation via model merges—learning task vectors for individually fine-tuned models and composing linear combinations, regularized by prior-encoded mutual information on score distributions—enables privacy-preserving, efficient adaptation to new prompts, outperforming joint training and other merging baselines (Lee et al., 24 May 2025).

3. Evaluation Protocols and Metrics

Cross-prompt AES is evaluated almost exclusively via the Quadratic Weighted Kappa (QWK) metric, which captures agreement with human scores, robust to differences in scale, class imbalance, and ordinal regression’s nuances. Both prompt-wise and trait-wise QWK are standard, often averaged over multiple traits. Ancillary metrics include Mean Absolute Error (MAE), Pearson correlation, and fairness measures such as Overall Score Difference (OSD), Conditional Score Difference (CSD), and Mean Absolute Error Difference (MAED) across demographic subgroups (Yang et al., 2024).

Experimental designs uniformly employ leave-one-prompt-out (LOPO) or k-fold cross-validation over all prompts. High-stakes deployment passes require not only elevated QWK but also demonstrable fairness, low bias under demographic or economic strata, and interpretability in feature contributions or scoring rationale.

Notably, cross-prompt models exhibit a measured decrease in predictive accuracy vis-à-vis prompt-specific models (∼9% QWK gap), but yield lower demographic bias and higher generalizability across unseen prompts (Yang et al., 2024).

4. Key Results and Practical Insights

Empirical findings indicate:

CJ via LLMs outperforms direct rubric-based prompting, achieving QWK up to 0.776 on fine-grained scales (CJ_F with GPT-4), without any prompt-specific tuning (Kim et al., 2024).
ATOP’s adversarial prompt-tuning surpasses all prior meta-learned or pseudo-labeling methods, achieving multi-trait QWK = 0.594 (Zhang et al., 8 Aug 2025).
PAES and SVM(Reduced) baselines maintain robust performance (QWK 0.686 and 0.744 respectively), remain stable even as score distributions vary, and show less fairness bias than more complex neural models (Ridley et al., 2020, Yang et al., 2024).
Trait-focused models (TRATES, ProTACT, GAPS) deliver consistent improvement on granular or prompt-independent traits—especially conventions, sentence fluency, and organization—often closing trait-level QWK gaps by 0.03–0.06 points (Eltanbouly et al., 20 May 2025, Do et al., 2023, Do et al., 12 Feb 2025).
Hybrid LLM + feature approaches, without retraining, yield significant benefits over pure instruction prompting in cross-prompt and cross-dataset scenarios (Hou et al., 13 Feb 2025).
Novel source-free model-merging via Prior-encoded Information Maximization (PIM) matches or exceeds joint-train approaches, is privacy-compliant, and is more computationally efficient (Lee et al., 24 May 2025).
Multi-agent, dialectical LLM systems (RES) exploit prompt-content and trait-specific rubric construction, achieving up to +34.86% QWK improvement over vanilla zero-shot LLM scoring across ASAP prompts (Jang et al., 18 Sep 2025).

A plausible implication is that the leading edge in cross-prompt AES has shifted from hand-engineered domain adaptation toward architectures that natively encode prompt context, exploit rich trait decompositions, or extract generalizable representations from the interiors of LLMs, sometimes surpassing the performance of explicit feature-based models.

5. Limitations, Bias, and Fairness Considerations

Despite methodological advances, fundamental challenges remain:

Scalability: Techniques requiring $\mathcal{O}(N^2)$ pairwise comparisons or many LLM inferences (e.g., CJ, RES roundtables) are nontrivial to deploy at scale (Kim et al., 2024, Jang et al., 18 Sep 2025).
Score Anchoring: Mapping latent or relative quality scores to absolute scales (e.g., via linear scaling of $\lambda_i$ estimates or task vector merging) may introduce bias if source–target distributions are mismatched, particularly for prompts with skewed or extreme label distributions (Kim et al., 2024, Lee et al., 24 May 2025).
Feature Relevance: Prompt-agnostic features, while general, risk discarding prompt-relevant signals essential for some traits (e.g., prompt adherence, content). Models addressing this via prompt-attention, topic-aware soft prompts, or rubric-question generation have seen the greatest empirical advances (Do et al., 2023, Zhang et al., 8 Aug 2025, Eltanbouly et al., 20 May 2025).
Fairness: Cross-prompt AES often reduces demographic bias relative to prompt-specific models, but residual disparities by economic status or English-learner status persist (Yang et al., 2024). Model complexity can exacerbate overfitting to spurious prompt-induced patterns unless regularized with prompt-generalization mechanisms.
Interpretability and Trust: Many high-performing LLM models lack interpretable rationales or trait-level decomposability. Recent advances in trait-driven, question-based, or activation-probed models offer improved transparency.

6. Future Directions and Open Problems

Key areas for future research include:

Efficient Comparative Judgment by active sampling (Adaptive Comparative Judgment, ACJ) and hybrid human–AI ranking procedures to minimize the number of required pairwise comparisons (Kim et al., 2024).
Unified Representations for Diverse Traits: Leveraging internal activations and head-specific “perspectives” within large LLMs (activation probing) to synthesize multi-trait, prompt-adaptive regressors with minimal fine-tuning (Chi et al., 22 Dec 2025).
Composable Model Adaptation: Extending linear task-vector model merging (PIM) to non-linear or subspace methods, incorporating pseudo-labels or lightweight prompt-tuning for extreme domain shifts (Lee et al., 24 May 2025).
Robustness and Calibration: Enhancing reliability under severe prompt, trait, or demographic distributional changes, possibly by integrating uncertainty measures or human-in-the-loop override protocols (Yang et al., 2024, Do et al., 2024).
Scalable Inference and Open-Source LLMs: Continued advancement of open-source, instruction-tuned models capable of robust cross-prompt performance without proprietary API dependence (Hou et al., 13 Feb 2025, Chi et al., 22 Dec 2025).
Feedback Generation and Explanations: Extending models to not only output reliable scores but also generate actionable, trait-specific feedback, closely aligned with human rater explanations (Mansour et al., 2024, Jang et al., 18 Sep 2025).

A plausible future convergence is a modular framework combining activation-probed LLMs, learned trait and prompt adapters, and efficient comparative or hybrid scoring strategies to balance generalizability, interpretability, and operational cost across high-diversity educational settings.

References

"Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition" (Kim et al., 2024)
"Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring" (Zhang et al., 8 Aug 2025)
"TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring" (Eltanbouly et al., 20 May 2025)
"Activate as Features: Probing LLMs for Generalizable Essay Scoring Representations" (Chi et al., 22 Dec 2025)
"Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring" (Ridley et al., 2020)
"Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring" (Do et al., 12 Feb 2025)
"Can LLMs Automatically Score Proficiency of Written Essays?" (Mansour et al., 2024)
"Improve LLM-based Automatic Essay Scoring with Linguistic Features" (Hou et al., 13 Feb 2025)
"Composable Cross-prompt Essay Scoring by Merging Models" (Lee et al., 24 May 2025)
"Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability" (Yang et al., 2024)
"Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring" (Do et al., 2023)
"Automated essay scoring with string kernels and word embeddings" (Cozma et al., 2018)
"Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards" (Do et al., 2024)
"LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring" (Jang et al., 18 Sep 2025)