Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments (2506.08260v1)

Published 9 Jun 2025 in cs.CL and cs.AI

Abstract: Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

Summary

  • The paper proposes an automated method that generates inference RC questions using a tailored taxonomy and few-shot chain-of-thought prompts.
  • It achieves a high overall item quality of 93.8% while struggling with precise alignment to targeted inference types at only 42.6%.
  • The approach offers scalable educational assessment support that can reduce teacher workload and guide improvements in AI prompting strategies.

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

This paper, authored by Wanjing Anya Ma, Michael Flor, and Zuowei Wang, presents an investigation into the automatic generation of reading comprehension (RC) questions targeting inference-making skills. The paper emphasizes the criticality of inference making in RC, which involves resolving pronominal references, text-connecting inferences, and gap-filling using prior knowledge. The authors propose that diagnostic questions for these inference types can significantly aid educators in providing targeted interventions to improve comprehension among students.

The authors introduce a taxonomy of inference types tailored for RC assessments and employ GPT-4o to automate the generation of bridging-inference RC items. The paper utilizes few-shot prompting and examines the efficacy of chain-of-thought (CoT) prompts in generating high-quality questions. Each generated item is evaluated based on three criteria: general item quality, the appropriateness of the inference type, and the reasoning provided by the LLM.

Key findings from the paper reveal that GPT-4o produced questions suitable for operational use in grade 3-12 educational contexts, achieving a high general item quality acceptance rate of 93.8%. However, the alignment of questions with the targeted inference type was only 42.6%, pointing to challenges in LLMs precisely generating questions focused on specific inference types. This suggests that while LLMs are adept at producing general high-quality items, their ability to discern and generate questions based on distinct inference categories remains a limitation.

The implications of this research are multifaceted. Practically, the ability to generate high-quality RC questions using AI has the potential to reduce the workload on educators, providing scalable solutions for ongoing assessment development. Theoretically, it highlights the nuanced challenges that LLMs face in understanding and categorizing linguistic subtleties, urging further exploration into improving prompting strategies and reasoning capabilities of AI models.

Future work should explore several aspects: enhancing model training with more diverse examples to improve inference-type accuracy, employing different LLM architectures to benchmark performance, and integrating user feedback from real-world deployments to refine the approach. Additionally, investigating the use of LLM-generated reasoning processes could provide insights into improving coherence and contextual understanding within AI-generated educational materials.

This paper contributes to advancing AI-driven educational assessments, underscoring the complexity involved in automatically generating nuanced RC questions and setting a foundation for future inquiry in optimizing LLMs for targeted educational tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets