Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Assessing the efficacy of large language models in generating accurate teacher responses (2307.04274v1)

Published 9 Jul 2023 in cs.CL and cs.LG

Abstract: (Tack et al., 2023) organized the shared task hosted by the 18th Workshop on Innovative Use of NLP for Building Educational Applications on generation of teacher language in educational dialogues. Following the structure of the shared task, in this study, we attempt to assess the generative abilities of LLMs in providing informative and helpful insights to students, thereby simulating the role of a knowledgeable teacher. To this end, we present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we fine-tuned the Flan-T5 model using reinforcement learning. Our experimental findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT. We hypothesize that several dataset characteristics, including sampling, representativeness, and dialog completeness, pose significant challenges to fine-tuning, thus contributing to the poor generalizability of the fine-tuned models. Finally, we note the need for these generative models to be evaluated with a metric that relies not only on dialog coherence and matched LLMing distribution but also on the model's ability to showcase pedagogical skills.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Dialogue systems for language learning: Chatbots and beyond. In The Routledge handbook of second language acquisition and technology, pages 121–135. Routledge.
  2. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. The teacher-student chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts. In Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, pages 23–35.
  5. The teacher-student chatroom corpus. arXiv preprint arXiv:2011.07109.
  6. Convokit: A toolkit for the analysis of conversations. arXiv preprint arXiv:2005.04246.
  7. Predictors of student satisfaction: A large-scale study of human-human online tutorial dialogues. International Educational Data Mining Society.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  9. Teacher coaching in a simulated environment. Educational evaluation and policy analysis, 42(2):208–231.
  10. James Collins. 1982. Discourse style, classroom interaction and differential treatment. Journal of reading behavior, 14(4):429–437.
  11. Measuring conversational uptake: A case study on student-teacher interactions. arXiv preprint arXiv:2106.03873.
  12. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.
  13. Daniel Jarratt. 2023. Chatgpt: The double-edged sword of ai in education.
  14. National center for teacher effectiveness main study.
  15. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
  16. Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29.
  17. Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. arXiv preprint arXiv:1804.06451.
  18. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  20. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  21. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
  22. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
  23. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  25. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349.
  26. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433.
  27. The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  28. Anaïs Tack and Chris Piech. 2022. The ai teacher test: Measuring the pedagogical ability of blender and gpt-3 in educational dialogues. arXiv preprint arXiv:2205.07540.
  29. OpenAI Team. 2022. Chatgpt: Optimizing language models for dialogue.
  30. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  31. Are we there yet?-a systematic literature review on chatbots in education. Frontiers in artificial intelligence, 4:654924.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  33. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yann Hicke (10 papers)
  2. Abhishek Masand (1 paper)
  3. Wentao Guo (17 papers)
  4. Tushaar Gangavarapu (3 papers)
Citations (9)