Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs (2401.06431v2)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of LLMs, including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

Introduction

Educational institutions around the globe are constantly seeking innovative solutions to provide timely and personalized feedback to learners, particularly in language education. With an expanding reliance on automated tools to supplement language learning, Automated Essay Scoring (AES) systems have garnered significant attention. The development and deployment of such systems are paramount, especially in contexts with high student-to-teacher ratios, where individual feedback from educators becomes a logistical challenge. This focus has led to the exploration of LLMs as tools for AES, where their capabilities are assessed in comparison to human instructors and traditional AES methodologies.

Enhancing AES with LLMs

LLMs like GPT-4 and fine-tuned GPT-3.5 have made substantial strides forward. They demonstrate capabilities that encompass superior accuracy, consistency, generalizability, and, critically, interpretability when compared to traditional models. An AES system powered by these advanced LLMs can offer detailed explanations for their scoring, a feature that commonly available AES tools often lack. Particularly in situations where specific grading criteria are complex, such as evaluating the logical structure of essays, LLMs reveal their adeptness at understanding and adhering to such intricate guidelines.

Human-AI Collaborative Grading

Human evaluation experiments complementing this research emphasize the collaborative prowess of AI and humans. The paper revealed that LLM-generated feedback can significantly augment the grading accuracy of novices, equating their performances to expert graders'. Expert graders also benefit from the AI's presence by maintaining greater scoring consistency and efficiency. This finding is pivotal because it illustrates how AI-generated feedback does not merely replace the human element but enhances it, promoting a synergy that could redefine educational assessments.

Conclusion and Future Directions

Concluding, the research underscores LLMs as formidable allies in the landscape of language education and, specifically, in the task of automated essay scoring. By integrating these advanced AI tools, the grading process not only becomes more effective but also supports educators and learners in a more personalized manner. It opens up a new dialog on the future of education technology, where the boundaries of AI assistance continue to expand, presenting a nuanced model of support for both students and teachers.

As the field of LLMs continues to evolve, the possibilities for refashioning educational tools and methodologies are vast. Further investigation is warranted to explore and understand the full scope of LLMs' abilities and to refine their collaborative roles within diverse educational settings. This research paves the way for future studies aimed at unraveling the nuanced dynamics of human-AI interactions and their implications for pedagogy and learning experiences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  2. Fei Dong and Yue Zhang. 2016. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1072–1077.
  3. Crowd score: A method for the evaluation of jokes using large language model ai voters as judges. arXiv preprint arXiv:2212.11214.
  4. Fabric: Automated scoring and feedback generation for essays. arXiv preprint arXiv:2310.05191.
  5. Towards mitigating LLM hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, Singapore. Association for Computational Linguistics.
  6. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  7. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  9. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13389–13396.
  10. Sandeep Mathias and Pushpak Bhattacharyya. 2018a. Asap++: Enriching the asap automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).
  11. Sandeep Mathias and Pushpak Bhattacharyya. 2018b. Thank “goodness”! a way to measure style in student essays. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pages 35–41.
  12. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  13. Eleni Miltsakaki and Karen Kukich. 2004. Evaluation of text coherence for electronic essay scoring systems. Natural Language Engineering, 10(1):25–55.
  14. Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050.
  15. Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 394–403, Toronto, Canada. Association for Computational Linguistics.
  16. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  17. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  18. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  19. Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review, 55(3):2495–2527.
  20. Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv preprint arXiv:2008.01441.
  21. Investigating neural architectures for short answer scoring. In Proceedings of the 12th workshop on innovative use of NLP for building educational applications, pages 159–168.
  22. Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.
  23. Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE), pages 1–6. IEEE.
  24. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705.
  25. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  26. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1070–1075.
  27. Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1882–1891.
  28. On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3416–3425, Seattle, United States. Association for Computational Linguistics.
  29. Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898.
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  31. Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 576–584, Toronto, Canada. Association for Computational Linguistics.
  32. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569.
  33. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  34. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Changrong Xiao (2 papers)
  2. Wenxing Ma (1 paper)
  3. Sean Xin Xu (3 papers)
  4. Kunpeng Zhang (31 papers)
  5. Yufang Wang (5 papers)
  6. Qi Fu (7 papers)
  7. Qingping Song (1 paper)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com