Expert Evaluation of AI-Generated Multiple-Choice Questions for Portuguese Reading Comprehension
The paper "From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns" explores the potential of generative models in automating the creation of Multiple-Choice Questions (MCQs) in Portuguese. This research fills a critical gap, as most prior studies have emphasized English, leaving other languages relatively unexplored.
Overview
The traditional process of crafting high-quality MCQs is often laborious and resource-intensive, particularly when aiming to achieve varying levels of difficulty and incorporate narrative elements. This paper investigates whether state-of-the-art generative models can effectively automate this process for the Portuguese language, focusing on text comprehension for elementary school students.
Methodology
Two methods for generating MCQs were employed:
- One-Step Generation: This approach involved using LLMs like GPT-4 to generate MCQs directly from a prompt that included a narrative text, task description, and desired output format.
- Two-Step Generation: This method first created a "wh-question" using a smaller, bilingual model, followed by LLMs to generate the complete MCQ, including the answer and distractors.
For evaluation, the generated MCQs were compared against human-authored questions using both expert reviews and psychometric analysis derived from student responses.
Key Findings
Expert Review:
- Human-authored MCQs generally adhered slightly more to clarity and contextual alignment than those generated by models.
- Semantic errors occasionally appeared in both human-authored and AI-generated options.
- Despite minor discrepancies, generated MCQs were often comparable to human-crafted questions, especially in terms of clarity and difficulty, according to human experts.
Psychometric Analysis:
- Difficulty and discrimination indices, derived from student responses, showed AI-generated questions to have a performance comparable to human-authored questions.
- LLM-generated MCQs engaged students with distractors as effectively as those written by humans.
Model-Annotated Difficulty:
- The difficulty ratings assigned by models showed a meaningful correlation with expert reviews and student performance, particularly when finalized post-generation rather than during it.
Implications and Future Directions
This research suggests that LLMs, like GPT-4, can be utilized effectively to generate educational assessment tools such as MCQs in underrepresented languages like Portuguese. However, the paper underscores that:
- Additional research is needed to refine model performance, such as reducing semantic errors.
- Further exploration of difficulty assignment during the generation process is warranted.
- The potential for employing these models in adaptive learning platforms, offering personalized education based on student needs, could be significant.
In conclusion, the potential of generative models in creating high-quality, multilingual educational resources is evident, offering considerable benefits for educators by reducing workload and enhancing resource accessibility in diverse linguistic settings.