Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns (2506.15598v1)

Published 18 Jun 2025 in cs.CL and cs.AI

Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.

Summary

Expert Evaluation of AI-Generated Multiple-Choice Questions for Portuguese Reading Comprehension

The paper "From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns" explores the potential of generative models in automating the creation of Multiple-Choice Questions (MCQs) in Portuguese. This research fills a critical gap, as most prior studies have emphasized English, leaving other languages relatively unexplored.

Overview

The traditional process of crafting high-quality MCQs is often laborious and resource-intensive, particularly when aiming to achieve varying levels of difficulty and incorporate narrative elements. This paper investigates whether state-of-the-art generative models can effectively automate this process for the Portuguese language, focusing on text comprehension for elementary school students.

Methodology

Two methods for generating MCQs were employed:

  1. One-Step Generation: This approach involved using LLMs like GPT-4 to generate MCQs directly from a prompt that included a narrative text, task description, and desired output format.
  2. Two-Step Generation: This method first created a "wh-question" using a smaller, bilingual model, followed by LLMs to generate the complete MCQ, including the answer and distractors.

For evaluation, the generated MCQs were compared against human-authored questions using both expert reviews and psychometric analysis derived from student responses.

Key Findings

Expert Review:

  • Human-authored MCQs generally adhered slightly more to clarity and contextual alignment than those generated by models.
  • Semantic errors occasionally appeared in both human-authored and AI-generated options.
  • Despite minor discrepancies, generated MCQs were often comparable to human-crafted questions, especially in terms of clarity and difficulty, according to human experts.

Psychometric Analysis:

  • Difficulty and discrimination indices, derived from student responses, showed AI-generated questions to have a performance comparable to human-authored questions.
  • LLM-generated MCQs engaged students with distractors as effectively as those written by humans.

Model-Annotated Difficulty:

  • The difficulty ratings assigned by models showed a meaningful correlation with expert reviews and student performance, particularly when finalized post-generation rather than during it.

Implications and Future Directions

This research suggests that LLMs, like GPT-4, can be utilized effectively to generate educational assessment tools such as MCQs in underrepresented languages like Portuguese. However, the paper underscores that:

  • Additional research is needed to refine model performance, such as reducing semantic errors.
  • Further exploration of difficulty assignment during the generation process is warranted.
  • The potential for employing these models in adaptive learning platforms, offering personalized education based on student needs, could be significant.

In conclusion, the potential of generative models in creating high-quality, multilingual educational resources is evident, offering considerable benefits for educators by reducing workload and enhancing resource accessibility in diverse linguistic settings.

Youtube Logo Streamline Icon: https://streamlinehq.com