NERIF: GPT-4V for Automatic Scoring of Drawn Models (2311.12990v2)

Published 21 Nov 2023 in cs.AI

Abstract: Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.

PDF Abstract

Introduction to NERIF and GPT-4V

In the field of educational technology, there has been significant progression towards the automation of time-consuming tasks such as scoring student work. One particularly challenging area has been the evaluation of student-drawn scientific models, which are crucial for assessing students' understanding of scientific phenomena. Recent advances in AI, specifically with the development of GPT-4V, a highly capable image classification system, offer novel possibilities for scoring these drawn models efficiently.

The Study of NERIF

Researchers at the University of Georgia conducted an innovative paper to explore this potential. By introducing NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning), they proposed a way to coach GPT-4V to evaluate student-drawn models with minimal human input. The paper incorporated a parental dataset comprising 900 student models, which had been previously scored by human experts. These models represented varying levels of proficiency under given scoring rubrics. The GPT-4V assessments were then compared against the consensus scores of human experts to measure accuracy.

GPT-4V's Performance in Scoring Student Models

The results illuminated the capabilities and limitations of GPT-4V in educational assessment. On average, GPT-4V scored the models with moderate accuracy, with a tendency to score more proficient models with less accuracy. This suggests a higher challenge in evaluating complex student work and points to the need for further refinements in the training process of such AI systems. Interestingly, when GPT-4V assigned incorrect scores, these were often still interpretable by science content experts, hinting at GPT-4V's potential for use as an assistive tool in educational settings.

Insights and Future Directions

Through qualitative analysis, researchers identified key behaviors of GPT-4V. It demonstrated the ability to decipher and analyze visual information based on predetermined rubrics and then articulate its reasoning in natural language—a significant departure from the less transparent scoring methods of traditional systems. Furthermore, the paper highlighted the influence of 'Instructional Notes' as part of the NERIF method, which provided GPT-4V with beneficial direction and resulted in improved performance.

The paper underscores both the promise and challenges of employing GPT-4V within science education. The fine-tuned capacity of such AI to interpret complex images and provide feedback could revolutionize the assessment landscape. However, the acknowledgment of existing gaps in scoring accuracy indicates a need for ongoing research and development to harness GPT-4V's full capabilities effectively.

As AI continues to evolve, it is anticipated that updates and broader access to GPT-4V's API will address current limitations and augment precision, reliability, and utility. For educators and researchers, the continued integration of AI like GPT-4V in education presents transformative opportunities to reduce workload and enhance the feedback given to students.

In conclusion, the NERIF method as applied to GPT-4V for educational assessments represents an exciting advancement yet calls for mindful consideration and continued innovation to ensure that its application complements the complex demands of scoring student-drawn models in science education.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Gyeong-Geon Lee (11 papers)
Xiaoming Zhai (48 papers)

Citations (8)

View on Semantic Scholar

NERIF: GPT-4V for Automatic Scoring of Drawn Models (2311.12990v2)

Introduction to NERIF and GPT-4V

The Study of NERIF

GPT-4V's Performance in Scoring Student Models

Insights and Future Directions

Related Papers