Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Automatically Score Proficiency of Written Essays? (2403.06149v2)

Published 10 Mar 2024 in cs.CL and cs.AI

Abstract: Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. LLMs are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Hongbo Chen and Ben He. 2013. Automated essay scoring by maximizing human-machine agreement. In Conference on Empirical Methods in Natural Language Processing.
  2. Jacob Cohen. 1968. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  4. Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 153–162, Vancouver, Canada. Association for Computational Linguistics.
  5. Evaluating multiple aspects of coherence in student essays. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 185–192.
  6. Improving domain generalization for prompt-aware essay scoring via disentangled representation learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12456–12470, Toronto, Canada. Association for Computational Linguistics.
  7. Zixuan Ke and Vincent Ng. 2019. Automated essay scoring: A survey of the state of the art. In IJCAI, volume 19, pages 6300–6308.
  8. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
  9. Annie Louis and Derrick Higgins. 2010. Off-topic essay detection using short prompt texts. In proceedings of the NAACL HLT 2010 fifth workshop on innovative use of NLP for building educational applications, pages 92–95.
  10. Sandeep Mathias and Pushpak Bhattacharyya. 2020. Can neural networks automatically score essay traits? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 85–91, Seattle, WA, USA → Online. Association for Computational Linguistics.
  11. Large language models as instructors: A study on multilingual clinical entity extraction. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 178–190, Toronto, Canada. Association for Computational Linguistics.
  12. Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050.
  13. Ellis B Page. 1966. The imminence of… grading essays by computer. The Phi Delta Kappan, 47(5):238–243.
  14. Modeling organization in student essays. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 229–239.
  15. Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 431–439, Lisbon, Portugal. Association for Computational Linguistics.
  16. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  17. On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation. arXiv preprint arXiv:2205.03835.
  18. Beyond benchmarks: Spotting key topical sentences while improving automated essay scoring performance with topic-aware bert. Electronics, 12(1):150.
  19. Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733.
  20. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569.
  21. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Watheq Mansour (2 papers)
  2. Salam Albatarni (3 papers)
  3. Sohaila Eltanbouly (3 papers)
  4. Tamer Elsayed (22 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com