Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor (2509.26483v1)

Published 30 Sep 2025 in cs.CY

Abstract: This study presents the first large-scale, side-by-side comparison of contemporary LLMs in the automated grading of programming assignments. Drawing on over 6,000 student submissions collected across four years of an introductory programming course, we systematically analysed the distribution of grades, differences in mean scores and variability reflecting stricter or more lenient grading, and the consistency and clustering of grading patterns across models. Eighteen publicly available models were evaluated: Anthropic (claude-3-5-haiku, claude-opus-4-1, claude-sonnet-4); Deepseek (deepseek-chat, deepseek-reasoner); Google (gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro); and OpenAI (gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, gpt-5). Statistical tests, correlation and clustering analyses revealed clear, systematic differences between and within vendor families, with "mini" and "nano" variants consistently underperforming their full-scale counterparts. All models displayed high internal agreement, measured by the intraclass correlation coefficient, with the model consensus but only moderate agreement with human teachers' grades, indicating a persistent gap between automated and human assessment. These findings underscore that the choice of model for educational deployment is not neutral and should be guided by pedagogical goals, transparent reporting of evaluation metrics, and ongoing human oversight to ensure accuracy, fairness and relevance.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.