Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Published 7 Jun 2026 in cs.SE, cs.AI, and cs.CL | (2606.08400v1)

Abstract: Graduate-level research reading report assessment creates a substantial labor burden for educators. While LLMs hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper establishes an LLM-assisted grading pipeline that benchmarks Grok and GPT models to mirror manual evaluations in graduate software engineering courses.
The paper demonstrates that sequential interaction histories cause significant grading drift, undermining consistency and introducing covert bias in LLM outputs.
The paper finds that isolating grading sessions and careful model selection are essential strategies for mitigating systemic inequities in automated assessments.

Impact of Model Architectures and Interaction Histories on LLM-Based Grading in Advanced Software Engineering Courses

Introduction

The automation of academic assessment through LLMs has potential to reshape graduate education workflows, particularly within domains where high-order cognitive evaluation, such as critical literature analysis, is pedagogically central. This study addresses the pressing challenge of ensuring consistency and fairness in automated grading by LLMs, which has direct ramifications for educational equity.

Methodological Framework

The paper establishes an LLM-assisted grading pipeline designed to closely parallel established manual practices in graduate-level Software Engineering courses. The workflow is carefully engineered: instructors input curated research papers into the LLM to generate summaries, define assignment requirements, and feed these, along with student reading reports, to the model for evaluation. Assessment output comprises paper identification, letter grade assignment, and constructive feedback.

Two LLMs—Grok (Grok-4.1-Fast) and GPT (GPT-oss-120b)—are benchmarked on a dataset of 180 authentic student reading report submissions. Human annotations, provided by multiple Computer Science doctoral candidates, serve as reference baselines. Grading consistency is analyzed within and across models, and effects of sequential grading histories are systematically interrogated. The evaluation protocol employs both distributional analyses (Wilcoxon signed-rank test) and reliability measurement (Intraclass Correlation Coefficient, ICC), complemented by Hit@k metrics for assessing model proficiency at identifying low-quality submissions.

Empirical Findings

Grading Consistency and Model Disparities

The results indicate pronounced intra- and inter-model inconsistencies:

Within-Model Consistency: Grok demonstrates moderate ranking consistency across repeated runs, whereas GPT's results display poor stability, corroborating prior findings on LLM volatility in evaluative tasks.
Inter-Model Divergence: ICC values for Grok versus GPT are consistently in the "poor" regime, even with identical prompts and requirements. This signifies that the grading rubrics implicitly constructed by different underlying architectures are fundamentally misaligned, precluding model interchangeability in high-stakes grading contexts.
Inaccuracy of Simple Ensembles: Naïve ensemble averaging does not consistently increase alignment with human grader identification of lower-quality work, and may on occasion degrade result reliability for critical identification tasks (as seen in Hit@20% metrics).
Identification of Low-Quality Submissions: Grok consistently outperforms GPT in identifying human-flagged lower-tier assignments, achieving up to a 55.17% hit rate at the 40% ranking threshold.

Impact of Grading Histories

Interaction history is identified as a critical confounder in LLM grading:

History-Induced Drift: The inclusion of sequential history in batch grading produces statistically significant shifts in model output distributions, independent of submission ordering. This demonstrates a systematic drift in grading standards, attributable to context accumulation. The effect is robust and reproducible: pairwise Wilcoxon tests (p < 0.001) confirm these shifts.
Ranking Volatility: The ICC between independent (no history) and sequential modes drops precipitously, indicating poor reproducibility of student rankings when history is preserved across submissions. Identical assignments may receive different scores solely based on their order in the grading queue, introducing covert bias.
Hit@k Degradation: Grading in isolation (fresh session per assignment) achieves the highest alignment with human graders in identifying bottom-tier work, whereas both ascending and descending sequential grading dilute this effectiveness.

Auxiliary Observations

Hallucination Absence: The latest LLMs did not produce hallucinated paper identifications on this domain-specific task, and sampled outputs contained no factual errors, suggesting a maturity for reading report assessment use cases.
Implementation Caveats: Occasional interaction instability (API anomalies, missing responses) necessitated response validity checks and conservative sample sizing.

Practical and Theoretical Implications

The findings substantiate the claim that model architecture choice and inferential context management decisively condition both the fairness and reproducibility of LLM-assisted grading. Indiscriminate application of LLMs for high-stakes educational assessment is liable to inject systemic inequities, especially if session history or model selection is left uncontrolled.

From a practical standpoint, rigorous cross-model comparison should be a prerequisite to deployment. Batch grading with persistent context must be avoided unless robust mitigation strategies (e.g., systematic context reset, explicit prompt designs instructing history suppression) are validated. The potential for bias propagation through interaction history calls for explicit audit trails and possibly, counterfactual simulation of alternative grading sequences.

Theoretically, the observed history effects underscore the unsolved challenge of context management in transformer-based models within decision-critical automation. This motivates future work on context-aware optimization and the development of AI agents explicitly trained for fair summative assessment, perhaps by leveraging histories as structured signals rather than sources of drift.

Conclusion

This study delivers an authoritative analysis of LLM-based grading workflows for graduate-level research reading reports. Strong empirical evidence is provided for substantial intra- and inter-model inconsistencies in grading, and for systematic history-induced grading drift that undermines assessment fairness. The paper recommends model- and task-specific validation and isolation of grading sessions to limit bias. Full automation of grading mandates rigorous control of LLM-specific biases and careful workflow design to ensure educational equity. Future work should expand experimental scope, investigate demographic bias, and engineer context-aware LLM agents capable of robust, fair assessment.

Markdown Report Issue