LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction (2310.05191v2)
Abstract: In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three metrics to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding quality and characteristics. On the other hand, EFL learners assess their learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.
- Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater® v.2. The Journal of Technology, Learning and Assessment, 4(3).
- Longformer: The long-document transformer.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
- The BEA-2019 shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75, Florence, Italy. Association for Computational Linguistics.
- Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
- Alister Cumming. 1990. Expertise in evaluating second language compositions. Language Testing, 7(1):31–51.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- RECIPE: How to integrate ChatGPT into EFL writing education.
- Shinichiro Ishikawa. 2018. The icnale edited essays; a dataset for analysis of l2 english learner essays based on a new integrative viewpoint. English Corpus Studies, 25:117–130.
- Testing ESL Composition: a Practical Approach. ERIC.
- All-in-one: Multi-task learning bert models for evaluating peer assessments.
- K Karathanos and DD Mena. 2009. Enhancing the academic writing skills of ell future educators: A faculty action research project. English learners in higher education: Strategies for supporting students across academic disciplines, pages 1–13.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
- Sandeep Mathias and Pushpak Bhattacharyya. 2018. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Burhan Ozfidan and Connie Mitchell. 2022. Assessment of students’ argumentative writing: A rubric development. Journal of Ethnic and Cultural Studies, 9(2):pp. 121–133.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback.
- A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 702–707, Online. Association for Computational Linguistics.
- An evaluation of intellimetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).
- Rebekah Sidman-Taveau and Katya Karathanos-Aguilar. 2015. Academic writing for graduate-level english as a second language students: Experiences in education. The CATESOL Journal, 27(1):27–52.
- The effects of an awe-aided assessment approach on business english writing performance and writing anxiety: A contextual consideration. Studies in Educational Evaluation, 72:101123.
- Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
- Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Chain-of-thought prompting for responding to in-depth dialogue questions with llm.
- Automatic essay scoring incorporating rating schema via reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 791–797, Brussels, Belgium. Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Practical and ethical challenges of large language models in education: A systematic literature review.
- Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online. Association for Computational Linguistics.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
- Judging llm-as-a-judge with mt-bench and chatbot arena.