Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues (2306.06941v1)

Published 12 Jun 2023 in cs.CL

Abstract: This paper describes the results of the first shared task on the generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative LLMs to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab. They experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT-2, GPT-3, GPT- 4, LLaMA, OPT-2.7B, and T5-base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 using an ensemble of prompts and a DialogRPT-based ranking of responses for given dialogue contexts. Despite the promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Adaeze Adigwe and Zheng Yuan. 2023. The ADAIO system at the BEA-2023 Shared Task: Shared task generating AI teacher responses in educational dialogues. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  2. RETUYT-InCo at BEA 2023 Shared Task: Tuning open-source LLMs for generating teacher responses. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  3. Dialogue systems for language learning: A meta-analysis. Language Learning & Technology, 26(1):1–24.
  4. On the opportunities and risks of foundation models. Technical report, Stanford University, Center for Research on Foundation Models (CRFM).
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates.
  6. The Teacher-Student Chatroom Corpus version 2: More lessons, new annotation, automatic detection of sequence shifts. In Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, pages 23–35, Louvain-la-Neuve, Belgium. LiU Electronic Press.
  7. The teacher-student chatroom corpus. In Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning, pages 10–20, Gothenburg, Sweden. LiU Electronic Press.
  8. Scaling instruction-finetuned language models. arXiv:2210.11416.
  9. Dialogue response ranking training with large-scale human feedback data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 386–395, Online. Association for Computational Linguistics.
  10. Assessing the efficacy of large language models in generating accurate teacher responses. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  11. LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
  12. Enhancing educational dialogues: A reinforcement learning approach for generating AI teacher responses. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  13. H. W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97.
  14. BLOOM: A 176B-parameter open-access multilingual language model. arXiv:2211.05100.
  15. ACUTE-EVAL: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv:1909.03087.
  16. Amin Omidvar and Aijun An. 2023. Empowering conversational agents using semantic in-context learning. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  17. CodaLab Competitions: An open source platform to organize scientific challenges. Technical report.
  18. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), pages 1–35.
  19. Language models are unsupervised multitask learners. OpenAI blog.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  21. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv:2210.01241.
  22. Anaïs Tack and Chris Piech. 2022. The AI Teacher Test: Measuring the pedagogical ability of Blender and GPT-3 in educational dialogues. In Proceedings of the 15th International Conference on Educational Data Mining, volume 15, pages 522–529, Durham, United Kingdom. International Educational Data Mining Society.
  23. Stanford Alpaca: An instruction-following LLaMA model. GitHub.
  24. LLaMA: Open and efficient foundation language models. arXiv:2302.13971.
  25. NAISTeacher: A prompt and rerank approach to generating teacher utterances in educational dialogues. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, page to appear, Toronto, Canada. Association for Computational Linguistics.
  26. Are we there yet? - a systematic literature review on chatbots in education. Frontiers in Artificial Intelligence, 4:654924.
  27. A Comprehensive Assessment of Dialog Evaluation Metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
  28. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, Online.
  29. DIALOGPT: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anaïs Tack (3 papers)
  2. Ekaterina Kochmar (33 papers)
  3. Zheng Yuan (117 papers)
  4. Serge Bibauw (2 papers)
  5. Chris Piech (33 papers)
Citations (17)