Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation (2404.15845v1)

Published 24 Apr 2024 in cs.CL

Abstract: Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. LLMs have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Aniket Ajit Tambe and Manasi Kulkarni. 2022. Automated essay scoring system with grammar score analysis. In 2022 Smart Technologies, Communication and Robotics (STCR), pages 1–7.
  2. Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–725, Berlin, Germany. Association for Computational Linguistics.
  3. Error syntax aware augmentation of feedback comment generation dataset. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 37–44, Prague, Czechia. Association for Computational Linguistics.
  4. The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1):29–58.
  5. Sentence-level feedback generation for English language learners: Does data augmentation help? In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 53–59, Prague, Czechia. Association for Computational Linguistics.
  6. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  7. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805, Vancouver, Canada. Association for Computational Linguistics.
  8. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  9. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. Thomas Eckes. 2015. Introduction to Many-Facet Rasch Measurement. Peter Lang Verlag, Berlin, Deutschland.
  12. Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 263–271, New Orleans, Louisiana. Association for Computational Linguistics.
  13. Tri Febriani. 2022. “Writing is challenging”: factors contributing to undergraduate students’ difficulties in writing English essays. Erudita: Journal of English Language Teaching, 2:83–93.
  14. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 252–263, Florence, Italy. Association for Computational Linguistics.
  15. The hewlett foundation: Automated essay scoring.
  16. Fabric: Automated scoring and feedback generation for essays.
  17. Exploring methods for generating feedback comments for writing learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9719–9730, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. John Hattie and Helen Timperley. 2007. The power of feedback. Review of Educational Research, 77(1):81–112.
  19. A trait-based deep learning automated essay scoring system with adaptive feedback. International Journal of Advanced Computer Science and Applications, 11(5).
  20. Retrieval, masking, and generation: Feedback comment generation using masked comment examples. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 60–67, Prague, Czechia. Association for Computational Linguistics.
  21. Grammar error correction using pseudo-error sentences and domain adaptation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 388–392, Jeju Island, Korea. Association for Computational Linguistics.
  22. Mistral 7b.
  23. Feedback comment generation using predicted grammatical terms. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 79–83, Prague, Czechia. Association for Computational Linguistics.
  24. Review of feedback in automated essay scoring.
  25. Noor Lide Abu Kassim. 2011. Judging behaviour and rater errors: an application of the many-facet rasch model. GEMA Online™ Journal of Language Studies, 179.
  26. Zixuan Ke and Vincent Ng. 2019. Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization.
  27. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  28. The Tokyo tech and AIST system at the GenChal 2022 shared task on feedback comment generation. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 74–78, Prague, Czechia. Association for Computational Linguistics.
  29. Vivekanandan Kumar and David Boulanger. 2020. Explainable automated essay scoring: Deep learning really has pedagogical value. Frontiers in Education, 5.
  30. The language of prompting: What linguistic properties make a prompt successful? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9210–9232, Singapore. Association for Computational Linguistics.
  31. Coherence-based automated essay scoring using self-attention. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pages 386–397, Cham. Springer International Publishing.
  32. Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014(1):1–23.
  33. Sandeep Mathias and Pushpak Bhattacharyya. 2020. Can neural networks automatically score essay traits? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 85–91, Seattle, WA, USA → Online. Association for Computational Linguistics.
  34. Using llms to bring evidence-based feedback into the classroom: Ai-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Computers and Education: Artificial Intelligence, 6:100199.
  35. Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050.
  36. Ryo Nagata. 2019. Toward a task of feedback comment generation for writing learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3206–3215, Hong Kong, China. Association for Computational Linguistics.
  37. OpenAI. 2023. ChatGPT (GPT version: 3.5). Large language model.
  38. John Peloghitis. 2017. Difficulties and strategies in argumentative writing: A qualitative analysis. In Transformation in language education, Tokyo. JALT.
  39. Modeling organization in student essays. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 229–239, Cambridge, MA. Association for Computational Linguistics.
  40. Incorporating coherence of topics as a criterion in automatic response-to-text assessment of the organization of writing. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 20–30, Denver, Colorado. Association for Computational Linguistics.
  41. Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review, 55(3):2495–2527.
  42. Jessica Riddell. 2015. Performance, feedback, and revision: Metacognitive approaches to undergraduate essay writing. Collected Essays on Learning and Teaching, 8:79.
  43. Alla Rozovskaya and Dan Roth. 2019. Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics, 7:1–17.
  44. Rebecca Schendel and Andrew Tolmie. 2016. Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in rwanda. Assessment & Evaluation in Higher Education, 42(5):673–689.
  45. Valerie J. Shute. 2008. Focus on formative feedback. Review of Educational Research, 78(1):153–189.
  46. Gee! grammar error explanation with large language models.
  47. Maja Stahl and Henning Wachsmuth. 2023. Identifying feedback types to augment feedback comment generation. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 31–36, Prague, Czechia. Association for Computational Linguistics.
  48. Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
  49. Aesprompt: Self-supervised constraints for automated essay scoring with prompt tuning. In The 34th International Conference on Software Engineering and Knowledge Engineering, SEKE 2022, KSIR Virtual Conference Center, USA, July 1 - July 10, 2022, pages 335–340. KSI Research Inc.
  50. Llama 2: Open foundation and fine-tuned chat models.
  51. Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika, 48(2):459–484.
  52. Neural automated essay scoring incorporating handcrafted features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077–6088, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  53. Sowmya Vajjala. 2018. Automated assessment of non-native learner essays: Investigating the role of linguistic features. International Journal of Artificial Intelligence in Education, 28(1):79–105.
  54. Effects of feedback in a computer-based learning environment on students’ learning outcomes: A meta-analysis. Review of Educational Research, 85(4):475–511.
  55. Aggregating multiple heuristic signals as supervision for unsupervised automated essay scoring. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13999–14013, Toronto, Canada. Association for Computational Linguistics.
  56. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  57. A prompt pattern catalog to enhance prompt engineering with chatgpt.
  58. Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  59. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online. Association for Computational Linguistics.
  60. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems.
Citations (16)

Summary

  • The paper demonstrates that integrating distinct prompt patterns, including persona-based approaches, can effectively enhance automated essay scoring performance.
  • It shows that one-shot in-context learning slightly outperforms few-shot techniques, leading to more precise feedback and improved scoring.
  • The research uncovers that prioritizing feedback generation before scoring deepens semantic analysis and produces more helpful, detailed evaluations.

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Introduction

The paper explores prompt-based methods for leveraging LLMs to simultaneously handle Automated Essay Scoring (AES) and feedback generation. It investigates the effectiveness of several prompting strategies in zero-shot and few-shot settings, hypothesizing that the challenges of AES can provide insights into enhancing the feedback generation process, and vice versa.

Methodology

The authors experiment with various prompt patterns and task instruction types to assess their influence on model performance. The prompts are designed around different personas such as a teacher's assistant and an educational researcher to provide context and possibly affect the model's output characteristics.

  • Prompt Patterns: A base pattern and several persona patterns are tested to see how imposing different roles on the model influences the performance.
  • Task Instructions: To explore the interactions between scoring and feedback, the authors rotate through instructions that prioritize scoring, feedback, or both in varied sequences.

A substantial part of the methodology revolves around in-context learning, where the LLM is provided with none, one, or multiple examples of scored essays complete with reasoning, aiming to enrich the model's response quality by teaching through examples.

Results and Discussion

In terms of AES, the paper finds that certain prompt patterns like the "educational researcher" tend to yield slightly better scoring performance. In-context learning shows promise, particularly with one-shot examples, which slightly outperform the more complex few-shot setting.

For feedback generation, the best results are obtained when the model focuses solely on generating feedback without the burden of scoring. The feedback quality is judged based on its helpfulness, which is assessed both automatically using LLMs and manually through human evaluation. The manual evaluations indicate that clear and precise feedback, which directly addresses and explains essay issues, is deemed most helpful by the evaluators.

Interestingly, strategies where feedback generation precedes scoring seem to provide better results than when scoring is conducted first. This could suggest that the process of formulating feedback forces the model into a deeper semantic processing of the text, which subsequently aids in a more informed scoring.

Implications and Future Work

The integration of AES with feedback generation signifies a substantial step forward in educational applications of NLP, highlighting a dual utility where scoring systems are not only evaluative but also formative. These findings have practical implications for developing more holistic educational tools that assist learning by providing both qualitative insights and quantitative evaluations.

Theoretically, the paper presents an interesting case for sequential processing of related NLP tasks, showing that the order in which tasks are executed could affect the performance of LLMs. Future research could explore this sequential interaction further, perhaps integrating more complex multitask learning frameworks or investigating the effects of simultaneous task processing using more advanced model architectures.

Challenges and Limitations

The reliance on detailed rubrics and the need for example-based in-context learning could limit the application of these methods in scenarios where such resources are scarce. Moreover, real-world application of the generated feedback and its reception by actual students remain to be tested.

The paper opens up several avenues for future exploration, including the refinement of feedback generation methods to improve clarity and usefulness, and adapting the techniques to broader educational contexts where detailed scoring rubrics may not be available.

X Twitter Logo Streamline Icon: https://streamlinehq.com