Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation (2404.15845v1)

Published 24 Apr 2024 in cs.CL

Abstract: Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. LLMs have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

References (60)

Citations (16)

View on Semantic Scholar

Summary

The paper demonstrates that integrating distinct prompt patterns, including persona-based approaches, can effectively enhance automated essay scoring performance.
It shows that one-shot in-context learning slightly outperforms few-shot techniques, leading to more precise feedback and improved scoring.
The research uncovers that prioritizing feedback generation before scoring deepens semantic analysis and produces more helpful, detailed evaluations.

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Introduction

The paper explores prompt-based methods for leveraging LLMs to simultaneously handle Automated Essay Scoring (AES) and feedback generation. It investigates the effectiveness of several prompting strategies in zero-shot and few-shot settings, hypothesizing that the challenges of AES can provide insights into enhancing the feedback generation process, and vice versa.

Methodology

The authors experiment with various prompt patterns and task instruction types to assess their influence on model performance. The prompts are designed around different personas such as a teacher's assistant and an educational researcher to provide context and possibly affect the model's output characteristics.

Prompt Patterns: A base pattern and several persona patterns are tested to see how imposing different roles on the model influences the performance.
Task Instructions: To explore the interactions between scoring and feedback, the authors rotate through instructions that prioritize scoring, feedback, or both in varied sequences.

A substantial part of the methodology revolves around in-context learning, where the LLM is provided with none, one, or multiple examples of scored essays complete with reasoning, aiming to enrich the model's response quality by teaching through examples.

Results and Discussion

In terms of AES, the paper finds that certain prompt patterns like the "educational researcher" tend to yield slightly better scoring performance. In-context learning shows promise, particularly with one-shot examples, which slightly outperform the more complex few-shot setting.

For feedback generation, the best results are obtained when the model focuses solely on generating feedback without the burden of scoring. The feedback quality is judged based on its helpfulness, which is assessed both automatically using LLMs and manually through human evaluation. The manual evaluations indicate that clear and precise feedback, which directly addresses and explains essay issues, is deemed most helpful by the evaluators.

Interestingly, strategies where feedback generation precedes scoring seem to provide better results than when scoring is conducted first. This could suggest that the process of formulating feedback forces the model into a deeper semantic processing of the text, which subsequently aids in a more informed scoring.

Implications and Future Work

The integration of AES with feedback generation signifies a substantial step forward in educational applications of NLP, highlighting a dual utility where scoring systems are not only evaluative but also formative. These findings have practical implications for developing more holistic educational tools that assist learning by providing both qualitative insights and quantitative evaluations.

Theoretically, the paper presents an interesting case for sequential processing of related NLP tasks, showing that the order in which tasks are executed could affect the performance of LLMs. Future research could explore this sequential interaction further, perhaps integrating more complex multitask learning frameworks or investigating the effects of simultaneous task processing using more advanced model architectures.

Challenges and Limitations

The reliance on detailed rubrics and the need for example-based in-context learning could limit the application of these methods in scenarios where such resources are scarce. Moreover, real-world application of the generated feedback and its reception by actual students remain to be tested.

The paper opens up several avenues for future exploration, including the refinement of feedback generation methods to improve clarity and usefulness, and adapting the techniques to broader educational contexts where detailed scoring rubrics may not be available.

PDF Markdown

Tweets

https://twitter.com/yishii_0207/status/1784442803277398117