Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course (2407.05216v2)

Published 7 Jul 2024 in cs.CL

Abstract: Using LLMs for automatic evaluation has become an important evaluation method in NLP research. However, it is unclear whether these LLM-based evaluators can be applied in real-world classrooms to assess student assignments. This empirical report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students. Based on student responses, we find that LLM-based assignment evaluators are generally acceptable to students when students have free access to these LLM-based evaluators. However, students also noted that the LLM sometimes fails to adhere to the evaluation instructions. Additionally, we observe that students can easily manipulate the LLM-based evaluator to output specific strings, allowing them to achieve high scores without meeting the assignment rubric. Based on student feedback and our experience, we provide several recommendations for integrating LLM-based evaluators into future classrooms. Our observation also highlights potential directions for improving LLM-based evaluators, including their instruction-following ability and vulnerability to prompt hacking.

Summary

The paper evaluates GPT-4's impact on automating assignment grading, reducing manual workload in a large course.
It details a methodology that adjusts prompts to ensure content coherence, factual accuracy, and language fluency across diverse assignments.
The paper highlights significant challenges, including a 47% incidence of prompt hacking, underscoring the need for robust safeguards.

LLM as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

This paper presents an empirical paper on the application of a LLM for automated assignment evaluation in a university-level generative AI course. The authors employed GPT-4 as an evaluation assistant across six assignments in a course enrolled by over 1,000 students at National Taiwan University. This report captures a comprehensive overview of the deployment process, student feedback, challenges encountered, particularly with prompt hacking, and strategic recommendations for future deployments.

Deployment and Methodology

The course, titled "Introduction to Generative AI," integrated LLM-based evaluation assistants (LLM TAs) to manage grading tasks efficiently. The LLM TAs were designed to evaluate diverse assignments, including summarization tasks, supervised fine-tuning of models, and safety assessments of generative AI outputs. Assignments were evaluated on several established criteria: content completeness, factual accuracy, language fluency, content coherence, and avoidance of repetition.

GPT-4 was selected as the LLM TA due to its demonstrated proficiency in various NLP tasks, as evidenced by prior works. Students could access the LLM TA free of charge through the DaVinci platform, powered by MediaTek, enabling them to receive real-time feedback and refine their submissions iteratively by regenerating evaluation results until they were satisfied with their scores.

Student Feedback and Acceptance

Feedback was collected via a survey at the end of the semester, focusing on the acceptability of LLM TAs, the problems encountered during their use, and the incidence of prompt hacking among students. Key findings from the student survey are:

Acceptability: LLM TAs were generally well-received, with 75% of the students finding them acceptable when made freely accessible. Students preferred having the capability to interact with LLM TAs directly, which allowed them to iteratively improve their submissions.
Problems Encountered: The major issues reported included LLM TAs not following the specified output format (51.3%) and failing to adhere to evaluation criteria, resulting in disproportionately high or low scores (33.7%).
Prompt Hacking: Nearly 47% of the students reported attempting prompt hacking to manipulate the LLM TA into giving them higher scores. Instances of prompt hacking ranged from direct instructions to complex manipulations, such as embedding new evaluation criteria within the submission.

Prompt Hacking Vulnerability

The paper highlights extensive examples and analyses of prompt hacking attempts, illustrating the significant vulnerability of LLM TAs to adversarial prompts. Examples include:

Directly instructing the LLM TA to give a perfect score.
Embedding additional evaluation criteria within the submission.
Using manipulative prompts to trick the LLM into decoding specific ASCII sequences or generating and evaluating its own essay.

Despite refining evaluation prompts to mitigate prompt hacking, the measures were found insufficient as students creatively bypassed safeguards, underscoring the challenge of making LLM-based evaluation secure.

Practical and Theoretical Implications

The research has several practical implications:

Enhanced Feedback Loop: LLM TAs can significantly enhance the feedback loop in educational settings by providing instant evaluations, allowing students to iteratively improve their work.
Scalability: The ability to deploy LLM TAs at scale could drastically reduce workload for educators, particularly in large courses.
Prompt Hacking Mitigation: The paper highlights the need for more sophisticated defenses against prompt hacking, possibly involving multiple layers of automatic checks and human review.

Theoretically, the results call for a deeper understanding of how LLM-based evaluators can be aligned more closely with human evaluators, addressing discrepancies and enhancing the robustness of automated evaluations.

Future Directions

While the paper acknowledges the current state-of-the-art in LLM technology, it also opens avenues for future research:

Improved Evaluation Prompts: Refining prompt designs to better withstand adversarial inputs.
Hybrid Evaluation Models: Combining LLM evaluations with human review to balance scalability with accuracy.
Advanced Detection Mechanisms: Developing enhanced mechanisms for automatic detection and mitigation of prompt hacking attempts.

Conclusion

This comprehensive evaluation of LLM TAs in a real-world educational context reveals both the potential and challenges of integrating advanced NLP models into classroom settings. While LLMs like GPT-4 offer valuable capabilities, their vulnerability to prompt hacking and occasional failure to adhere strictly to evaluation criteria suggest that further refinements are necessary. The insights drawn from this paper provide a foundational understanding that future research and educational practices can build upon, optimizing the blend of AI and human oversight in academic assessments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/dcml0714/status/1810500057386647883

https://twitter.com/fly51fly/status/1812244446169428197

https://twitter.com/knishimae0531/status/1812339903944089665

https://twitter.com/GptMaestro/status/1811043571115643254

YouTube

Show All Videos