Exploring Generative AI assisted feedback writing for students' written responses to a physics conceptual question with prompt engineering and few-shot learning (2311.06180v2)

Published 10 Nov 2023 in physics.ed-ph

Abstract: Instructor's feedback plays a critical role in students' development of conceptual understanding and reasoning skills. However, grading student written responses and providing personalized feedback can take a substantial amount of time. In this study, we explore using GPT-3.5 to write feedback to student written responses to conceptual questions with prompt engineering and few-shot learning techniques. In stage one, we used a small portion (n=20) of the student responses on one conceptual question to iteratively train GPT. Four of the responses paired with human-written feedback were included in the prompt as examples for GPT. We tasked GPT to generate feedback to the other 16 responses, and we refined the prompt after several iterations. In stage two, we gave four student researchers the 16 responses as well as two versions of feedback, one written by the authors and the other by GPT. Students were asked to rate the correctness and usefulness of each feedback, and to indicate which one was generated by GPT. The results showed that students tended to rate the feedback by human and GPT equally on correctness, but they all rated the feedback by GPT as more useful. Additionally, the successful rates of identifying GPT's feedback were low, ranging from 0.1 to 0.6. In stage three, we tasked GPT to generate feedback to the rest of the student responses (n=65). The feedback was rated by four instructors based on the extent of modification needed if they were to give the feedback to students. All the instructors rated approximately 70% of the feedback statements needing only minor or no modification. This study demonstrated the feasibility of using Generative AI as an assistant to generating feedback for student written responses with only a relatively small number of examples. An AI assistance can be one of the solutions to substantially reduce time spent on grading student written responses.

Citations (9)

View on Semantic Scholar

Summary

The paper demonstrates using GPT-3.5 with prompt engineering and few-shot learning to generate feedback for physics student responses, finding it required minor edits ~70% of the time and was rated more useful by students.
Instructor evaluation showed approximately 70% of the AI-generated feedback required minimal or no modification, significantly reducing potential grading effort.
Student evaluators rated the generative AI feedback as consistently more useful than human-generated feedback and had difficulty identifying whether feedback was AI-generated.

The paper investigates the feasibility of employing GPT-3.5 using prompt engineering and few-shot learning techniques to generate instructor-like feedback on student written responses to a physics conceptual question. The paper situates itself in the context of physics education research where providing personalized, constructive feedback is crucial yet time-consuming, particularly in large-enroLLMent courses.

The paper is organized into three distinct stages:

Stage 1 – Prompt Engineering and Feedback Generation:
- A set of 20 student responses was initially categorized into four distinct groups based on the correctness of the conclusion and explanation.
- Four representative response–feedback pairs (one from each category) were manually selected and used as in-context examples within a carefully engineered prompt.
- The prompt included extensive contextual information on common student misconceptions (e.g., the misinterpretation of force transmission and division), relevant Newtonian mechanics principles, and explicit feedback guidelines.
- Iterative refinements were made to the prompt until GPT-3.5 consistently produced feedback without blatantly replicating documented misconceptions.
- Using this prompt, GPT generated feedback for another set of 16 student responses. The generated outputs were initially vetted by the authors for correctness.
Stage 2 – Student Researcher Evaluation:
- Four student researchers (one graduate and three undergraduates with significant physics background) evaluated paired feedback (human-generated vs. GPT-generated) for the same 16 responses.
- The evaluation criteria included the perceived scientific correctness and usefulness of the feedback, as well as the ability to distinguish AI-generated text.
- While ratings for correctness did not exhibit a consistent bias toward either source, all student researchers rated the GPT-generated feedback as more useful. This difference in perceived usefulness is attributed to the more elaborated and context-responsive nature of the GPT output, particularly for responses with either complete explanations or missing reasoning.
Stage 3 – Instructor Evaluation:
- The refined prompt was then applied to generate feedback for 65 additional student responses.
- Four physics instructors assessed these outputs using a 0–3 rating scale based on the extent of modifications required before the feedback could be disseminated to students.
- Approximately 70% (ranging from 68% to 78%) of the feedback required only minor or no modifications, and the average rating across instructors was 2.06. This suggests that the AI-generated feedback was largely acceptable and required minimal human intervention.

The method leverages few-shot learning by embedding domain-specific examples into the prompt, thus enabling GPT-3.5 to appropriately address common student preconceptions in Newtonian mechanics without extensive fine-tuning. The paper demonstrates that even with a small number of pre-labeled examples, a LLM (LLM LLM) can be conditioned to deliver detailed and pedagogically sound feedback.

Key numerical and technical observations include:

Feedback Correctness:
- Student researchers rated both human and GPT-generated feedback similarly in terms of correctness.
Feedback Usefulness:
- GPT-generated feedback consistently received higher usefulness scores across all student evaluators. For instances where the GPT output was unanimously rated more useful, the added detail and probing questions were noted as advantageous.
Detection Rate:
- The ability of evaluators to correctly identify AI-generated feedback was significantly low, with success rates ranging from 0.1 to 0.6, indicating that the outputs were largely perceived as human-like.
Instructor Assessment:
- Instructor ratings indicate a reduction in grading effort, with about 70% of the AI-generated responses deemed nearly or entirely ready for student use.

Limitations noted in the paper include the reliance on a single conceptual question from an introductory physics course, the manual categorization of student responses (which may not scale efficiently), and the relatively small sample sizes of both student and instructor evaluators. Furthermore, the paper acknowledges issues related to potential model hallucinations and biases inherent in generative AI systems, recommending a human-in-the-loop approach for final feedback dissemination.

In summary, the paper offers a technical exploration of leveraging prompt engineering and few-shot learning with GPT-3.5 to generate meaningful, context-aware feedback on physics problem-solving. The results highlight strong potential for reducing grading workload while maintaining high consistency and perceived usefulness in feedback delivery, albeit with cautions regarding scalability and the necessity of further validation across more diverse student responses and educational contexts.

PDF Markdown

Exploring Generative AI assisted feedback writing for students' written responses to a physics conceptual question with prompt engineering and few-shot learning (2311.06180v2)

Summary

Related Papers