Comparison of Large Language Models for Generating Contextually Relevant Questions (2407.20578v2)
Abstract: This study explores the effectiveness of LLMs for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.
- Ivo Lodovico Molina (1 paper)
- Valdemar Švábenský (34 papers)
- Tsubasa Minematsu (2 papers)
- Li Chen (590 papers)
- Fumiya Okubo (7 papers)
- Atsushi Shimada (12 papers)