Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction (2310.05191v2)

Published 8 Oct 2023 in cs.CL

Abstract: In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three metrics to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding quality and characteristics. On the other hand, EFL learners assess their learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater® v.2. The Journal of Technology, Learning and Assessment, 4(3).
  2. Longformer: The long-document transformer.
  3. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
  4. Toefl11: A corpus of non-native english. ETS Research Report Series, 2013(2):i–15.
  5. The BEA-2019 shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75, Florence, Italy. Association for Computational Linguistics.
  6. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
  7. Alister Cumming. 1990. Expertise in evaluating second language compositions. Language Testing, 7(1):31–51.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. RECIPE: How to integrate ChatGPT into EFL writing education.
  10. Shinichiro Ishikawa. 2018. The icnale edited essays; a dataset for analysis of l2 english learner essays based on a new integrative viewpoint. English Corpus Studies, 25:117–130.
  11. Testing ESL Composition: a Practical Approach. ERIC.
  12. All-in-one: Multi-task learning bert models for evaluating peer assessments.
  13. K Karathanos and DD Mena. 2009. Enhancing the academic writing skills of ell future educators: A faculty action research project. English learners in higher education: Strategies for supporting students across academic disciplines, pages 1–13.
  14. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
  15. Sandeep Mathias and Pushpak Bhattacharyya. 2018. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  16. Burhan Ozfidan and Connie Mitchell. 2022. Assessment of students’ argumentative writing: A rubric development. Journal of Ethnic and Cultural Studies, 9(2):pp. 121–133.
  17. Check your facts and try again: Improving large language models with external knowledge and automated feedback.
  18. A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 702–707, Online. Association for Computational Linguistics.
  19. An evaluation of intellimetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4).
  20. Rebekah Sidman-Taveau and Katya Karathanos-Aguilar. 2015. Academic writing for graduate-level english as a second language students: Experiences in education. The CATESOL Journal, 27(1):27–52.
  21. The effects of an awe-aided assessment approach on business english writing performance and writing anxiety: A contextual consideration. Studies in Educational Evaluation, 72:101123.
  22. Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
  23. Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  24. Chain-of-thought prompting for responding to in-depth dialogue questions with llm.
  25. Automatic essay scoring incorporating rating schema via reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 791–797, Brussels, Belgium. Association for Computational Linguistics.
  26. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  27. Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  28. Practical and ethical challenges of large language models in education: A systematic literature review.
  29. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online. Association for Computational Linguistics.
  30. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena.
Citations (8)

Summary

  • The paper presents a novel AES pipeline, FABRIC, that leverages the DREsS rubric-based dataset to enable more precise essay scoring.
  • It employs the CASE augmentation strategy to inject deliberate errors, improving model accuracy by 45.44%.
  • Using EssayCoT prompting, the approach generates specific, constructive feedback that advances AI-driven educational assessments.

Automated Scoring and Feedback Generation for Essays: The FABRIC Approach

The paper entitled "FABRIC: Automated Scoring and Feedback Generation for Essays" presents a methodologically rigorous approach to enhancing automated essay scoring (AES) by addressing limitations in existing frameworks that focus predominantly on providing holistic scores. This paper, conducted by a team of researchers from KAIST, introduces a comprehensive pipeline for generating detailed scores and feedback, leveraging NLP techniques and LLMs. The key contributions include the introduction of a novel rubric-based dataset, DREsS, augmentation strategies for improving model performance, and enhanced feedback generation through an innovative prompting method, EssayCoT.

Core Components of FABRIC

1. DREsS Dataset: A significant innovation of this work is the creation of the Dataset for Rubric-based Essay Scoring (DREsS). This dataset distinguishes itself by featuring essays scored on three key rubrics—content, organization, and language—crafted in collaboration with English education experts. The dataset comprises 1,782 essays from EFL learners and restructured existing datasets into standardized formats for consistency in scoring criteria. This dataset lays the groundwork for training and testing rubric-based AES systems with a high degree of specificity and pedagogical relevance.

2. CASE - Corruption-based Augmentation Strategy for Essays: To enhance the robustness of AES models, the authors propose CASE, a unique data augmentation methodology that involves altering well-written essays with deliberate errors. This methodology aims to increase model accuracy by 45.44%, which addresses one of the primary concerns of previous AES models: variability and generalizability. The method applies distinct corruption strategies tailored to each rubric, which are validated through empirical analysis demonstrating significant performance improvements on the DREsS.

3. EssayCoT - Essay Chain-of-Thought Prompting: Addressing the feedback generation challenge, EssayCoT extends the Chain-of-Thought prompting paradigm to incorporate essay scoring. Unlike previous approaches requiring extensive human-generated examples, EssayCoT utilizes embedded scores to guide feedback generation, significantly enhancing the feedback's relevance and usefulness as assessed by domain experts. This novel application of CoT marks a notable advancement in leveraging LLMs for educational purposes, allowing for more precise and constructive feedback tailored to individual essay reviews.

Experimental and Theoretical Implications

Quantitative evaluation of the FABRIC pipeline reveals a strong alignment with rubric-based scoring and feedback objectives. The QWK scores achieved across various augmentation settings elucidate the effectiveness of the proposed methods. Beyond the immediate improvements in essay evaluation systems, the work opens up several avenues for further exploration in AI-driven educational assistance. With the integration of AES models that surpass traditional holistic scoring frameworks, and feedback mechanisms offering granular insights into specific writing qualities, FABRIC sets a new benchmark in the domain of automated language assessment tools.

Future Directions

This paper's contributions and evaluations point toward potential enhancements in AES applications, notably in LLM-driven educational platforms. As future developments, incorporating human-in-the-loop strategies might further increase the customization and applicability of feedback, aligning them more closely with individual learning trajectories. Additionally, refining the explainability and transparency of AI models in education could bolster their acceptance and trustworthiness among educators and learners. The potential to expand this research into multilingual contexts and interdisciplinary educational utilities remains promising, inviting further interdisciplinary collaboration.

In summary, FABRIC presents an integrated approach to AES that not only refines scoring accuracy but also introduces a novel framework for generating insightful feedback. This work enhances the practical utility of AES models in instructional settings and lays the groundwork for subsequent research in AI-facilitated learning environments.

X Twitter Logo Streamline Icon: https://streamlinehq.com