- The paper proposes an automated method that generates inference RC questions using a tailored taxonomy and few-shot chain-of-thought prompts.
- It achieves a high overall item quality of 93.8% while struggling with precise alignment to targeted inference types at only 42.6%.
- The approach offers scalable educational assessment support that can reduce teacher workload and guide improvements in AI prompting strategies.
Automatic Generation of Inference Making Questions for Reading Comprehension Assessments
This paper, authored by Wanjing Anya Ma, Michael Flor, and Zuowei Wang, presents an investigation into the automatic generation of reading comprehension (RC) questions targeting inference-making skills. The paper emphasizes the criticality of inference making in RC, which involves resolving pronominal references, text-connecting inferences, and gap-filling using prior knowledge. The authors propose that diagnostic questions for these inference types can significantly aid educators in providing targeted interventions to improve comprehension among students.
The authors introduce a taxonomy of inference types tailored for RC assessments and employ GPT-4o to automate the generation of bridging-inference RC items. The paper utilizes few-shot prompting and examines the efficacy of chain-of-thought (CoT) prompts in generating high-quality questions. Each generated item is evaluated based on three criteria: general item quality, the appropriateness of the inference type, and the reasoning provided by the LLM.
Key findings from the paper reveal that GPT-4o produced questions suitable for operational use in grade 3-12 educational contexts, achieving a high general item quality acceptance rate of 93.8%. However, the alignment of questions with the targeted inference type was only 42.6%, pointing to challenges in LLMs precisely generating questions focused on specific inference types. This suggests that while LLMs are adept at producing general high-quality items, their ability to discern and generate questions based on distinct inference categories remains a limitation.
The implications of this research are multifaceted. Practically, the ability to generate high-quality RC questions using AI has the potential to reduce the workload on educators, providing scalable solutions for ongoing assessment development. Theoretically, it highlights the nuanced challenges that LLMs face in understanding and categorizing linguistic subtleties, urging further exploration into improving prompting strategies and reasoning capabilities of AI models.
Future work should explore several aspects: enhancing model training with more diverse examples to improve inference-type accuracy, employing different LLM architectures to benchmark performance, and integrating user feedback from real-world deployments to refine the approach. Additionally, investigating the use of LLM-generated reasoning processes could provide insights into improving coherence and contextual understanding within AI-generated educational materials.
This paper contributes to advancing AI-driven educational assessments, underscoring the complexity involved in automatically generating nuanced RC questions and setting a foundation for future inquiry in optimizing LLMs for targeted educational tasks.