AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams (2408.11728v1)

Published 21 Aug 2024 in math.HO

Abstract: Effective and timely feedback in educational assessments is essential but labor-intensive, especially for complex tasks. Recent developments in automated feedback systems, ranging from deterministic response grading to the evaluation of semi-open and open-ended essays, have been facilitated by advances in machine learning. The emergence of pre-trained LLMs, such as GPT-4, offers promising new opportunities for efficiently processing diverse response types with minimal customization. This study evaluates the effectiveness of a pre-trained GPT-4 model in grading semi-open handwritten responses in a university-level mathematics exam. Our findings indicate that GPT-4 provides surprisingly reliable and cost-effective initial grading, subject to subsequent human verification. Future research should focus on refining grading rules and enhancing the extraction of handwritten responses to further leverage these technologies.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates the feasibility of using AI with GPT-4 to grade handwritten university mathematics exams, achieving an average accuracy of 62% compared to human benchmarks.
The paper employs diverse OCR methods and dual grading rubrics to convert handwritten answers to digital text and assess grading reliability with metrics like Krippendorff's alpha.
The paper outlines practical implications for reducing grading workloads and scaling assessments while highlighting the need for robust confidence measures and human oversight.

An Exploration of AI-Assisted Grading in University-Level Mathematics

The paper, "AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams," examines the efficacy of employing AI to augment grading processes within the domain of higher education mathematics. The paper pivots around assessing the viability of LLMs, specifically GPT-4, for automating the grading of handwritten exam responses, aiming to alleviate the demanding workload traditionally associated with grading mathematics tasks that involve semi-open or short-answer formats.

The investigation begins by acknowledging the constraints of current grading systems, which often default to closed-format assessments due to resource limitations. This tends to neglect more meaningful assessment types like short-answer questions. Through this lens, the paper seeks to explore whether GPT-4 can reliably grade such responses after an initial conversion using Optical Character Recognition (OCR) tools like Mathpix and GPT-4V (Vision-enabled GPT-4).

Methodology and Findings

The researchers adopted several OCR approaches to digitize handwritten submissions into machine-readable formats, trialing both pre-extracted answer boxes and full-page processing to determine optimal conditions for accurate transcription. They found variable success rates, with pre-extraction often missing partial answers when students wrote outside boxed areas, whereas whole-page OCR was more comprehensive but prone to misinterpreting non-answer content due to extraneous graphical elements.

Subsequently, grading rubrics were scrutinized for AI adaptability. Two main configurations were implemented: the original rubric, mirroring human grading expectations, and an itemized version designed to simplify multi-criteria judgments into more granular binary judgments. The team evaluated these approaches across different OCR-derived datasets, calibrating for reliability using probabilistic sampling and confidence measures informed by standard deviations of grading outputs.

The comparison of AI-generated grades with established human-assigned ground truths revealed an average accuracy level around 62%, with Krippendorff's alpha indices reflecting moderate levels of inter-rater reliability between AI and human judgments. However, analyses highlighted that AI performance was contingent on precise, fine-tuned grading rules, which needed to be explicit about each expected outcome.

Implications and Future Directions

The paper illuminates several practical implications for AI integration in educational settings, particularly within STEM disciplines. Firstly, the feasibility of AI assistance in grading is affirmed, albeit with a caveat towards trustworthiness when no human oversight is afforded. Thus, the development of robust confidence measures in AI decisions remains an urgent frontier. AI-grading systems must be mature enough to flag potentially unreliable grading outcomes warranting human intervention, thereby reducing high false-positive rates observed in some scenarios.

Theoretically, the research contributes to ongoing discourse regarding the adaptability of LLMs across domains without task-specific customizations. While the paper demonstrates potential, it also underscores the nuanced requirements for effective tool deployment in educational environments, such as cultural and contextual adaptations in multi-language settings.

Conclusion

The exploration of AI-assisted grading in this context reflects incremental progress toward a more automated, scalable assessment framework. Yet, challenges in the current methodologies, particularly in handwriting recognition and confidence assessment, invite further refinement. Enhancing AI capabilities to produce reliable and high-stakes evaluation without necessitating full human oversight remains the ultimate objective. Future advancements in AI customization and performance prediction mechanisms will be vital to realizing the full potential of such systems in academia, shaping the future landscape of educational assessment technology.