Reading Comprehension Exercise Generation

Updated 1 December 2025

RCEG is defined as the automated creation of reading comprehension tasks such as questions, answers, and distractors, enhancing literacy assessment.
It leverages transformer-based models and modular pipelines to control skill, difficulty, and content coverage across multiple exercise types.
Current trends address personalization, multilingual support, and rigorous evaluation using automated metrics and human expert review.

Reading Comprehension Exercise Generation (RCEG) comprises the automated creation of reading comprehension tasks—including questions, answers, and distractors—given input passages. These tasks target assessment, instruction, and research in literacy and language learning. Over the past decade, RCEG has advanced from pattern-based pipelines to transformer-based LLMs with controllability for skill, difficulty, and content coverage, supporting open-ended, fill-in-the-blank, and multiple-choice formats across diverse languages and reading levels.

1. Problem Formulation and Scope

RCEG is formally defined as the process of mapping an input document or passage $D$ to a set of reading comprehension exercises $Q = \{q_1, \dots, q_K\}$ , where $q_i$ may be a question-answer pair, a multiple-choice question (MCQ), or another test item type, together with supporting distractors when required. Formally, the system is required to maximize several joint objectives: coverage of key content elements, diversity of question types, appropriateness of difficulty or skill calibration, as well as syntactic, semantic, and pedagogical quality (Yang et al., 30 Jul 2025, Huang et al., 24 Nov 2025).

Recent frameworks express this as a modular, skill- and difficulty-conditioned sequence-to-sequence problem, often parameterized as $p_\theta(q \mid C, s, a, \ell)$ , where $C$ is the context, $s$ a comprehension skill, $a$ an answer (if specified), and $\ell$ a difficulty level (Wang et al., 2023, Kumar et al., 2023, Wang et al., 2023).

2. Model Architectures and Generation Pipelines

RCEG systems follow a variety of architectures depending on subtask specialization, as tabulated below:

Subsystem	Model Examples	Key Mechanisms
Question Generation	T5, Flan-T5, BART, Llama	Seq2Seq, transformer, answer conditioning, skill/difficulty prompts (Kumar et al., 2023, Yang et al., 30 Jul 2025, Wang et al., 2023)
Answer Generation	Extractive or generative QA	Pointer networks, span prediction (Kumar et al., 2018)
Distractor Generation	Hierarchical encoder-decoder, GPT, PLMs	Static/dynamic attention, mask-based decoding, knowledge-based ranking (Lin et al., 29 May 2024, Gao et al., 2018, Zhang, 2023)
Exercise Selection	Discriminator, overgenerate-and-rank	Perplexity/DM rankers, reward models (Huang et al., 24 Nov 2025, Kumar et al., 2023)

A prototypical pipeline consists of:

Preprocessing and Content Selection: Tokenization, semantic/syntactic tagging, and content segmentation for summarization and coverage (Yang et al., 30 Jul 2025, Zhang, 2023).
Candidate Generation: Fine-tuned transformer models generate questions, answers, and distractors under controlled prompts for skill, difficulty, and type (Kumar et al., 2023, Wang et al., 2023, Lin et al., 29 May 2024).
Filtering and Selection: Overgenerate-and-rank frameworks sample multiple candidates and apply scoring models to select high-quality, pedagogically aligned items (Kumar et al., 2023, Huang et al., 24 Nov 2025).
Post-hoc Control and Filtering: Dynamic attribute graph (DATG) reweighting, GeDi-based toxicity filtering, and heuristics for answer-in-question, length, and answerability (Huang et al., 24 Nov 2025, Zhang, 2023).
Output Integration: Assembling validated (question, correct answer, distractors) sets for end-use (Zhang, 2023, Lin et al., 29 May 2024).

3. Exercise Types, Skill and Difficulty Control

RCEG covers a spectrum of exercise types:

Literal, Inferential, and Bridging-Inference Question Generation: Classification by the type of cognitive operation required (e.g., retrieval, gap-filling, reference resolution) (Ma et al., 9 Jun 2025, Ghanem et al., 2022).
Skill-Conditioned Generation: Systems such as SkillQG (Wang et al., 2023) and HTA-WTA (Ghanem et al., 2022) enforce targeting of Bloom’s taxonomy-derived skills or story-based categories by including explicit skill tokens and stepwise prompting for question focus and background knowledge.
Difficulty Controllability: Fine-grained control of difficulty is achieved by tailored prompts, question templates, or supervised learning with difficulty labels, especially in multi-level educational contexts (Gao et al., 2018, Yang et al., 30 Jul 2025).

Difficulty and skill conditioning is operationalized via:

Augmented inputs: $q \sim p_\theta(\cdot| C, s, a, \ell)$ (Wang et al., 2023).
Explicit prompting: “Generate a Grade 1 factual question…” (Yang et al., 30 Jul 2025, Wang et al., 2023).
Learning from annotated corpora: Questions labeled with difficulty/skill metadata enable models to match target distributions (Ghanem et al., 2022, Ma et al., 9 Jun 2025).

4. Distractor Generation and MCQ Expansion

Distractor generation for MCQs is addressed via:

Hierarchical Encoder–Decoder Networks: Systems model both sentence- and word-level dependencies to generate semantically plausible, distractor options, leveraging global and static attention to avoid answer overlap and promote contextual relevance (Gao et al., 2018).
Mask-based and Multi-task Learning (DGRC): Hard chain-of-thought reasoning, sequential and end-to-end mask decoding, and multi-task fine-tuning yield significant performance improvement, especially for context-sensitive, exam-style distractors (Lin et al., 29 May 2024).
Hybrid NLP and Knowledge Approaches: Lexical, semantic, and named-entity-based candidate gathering and scoring, including knowledge base lookups, embedding similarities, and edit distance heuristics (Zhang, 2023).
Filtering and Diversity Enforcement: Jaccard distance, distractor order shuffling, and ranking mechanisms to maximize diversity and plausibility (Lin et al., 29 May 2024, Gao et al., 2018).

5. Evaluation Metrics and Experimental Protocols

Quality assessment in RCEG integrates both automatic and human evaluation protocols:

Automated Metrics: BLEU-n, ROUGE-n/L, METEOR, BERTScore, MAP@N, Q-BLEU-4, contextual factuality (CTC), FreeEval, and text informativity (TI = answerability – guessability) (Yang et al., 30 Jul 2025, Kumar et al., 2023, Huang et al., 24 Nov 2025, Säuberli et al., 11 Apr 2024).
Human Expert Review: Annotators rate item answerability, fluency, grammaticality, developmental appropriateness, and skill/type alignment. Inter-annotator agreement is measured via Cohen’s κ or Fleiss’ κ (Ma et al., 9 Jun 2025, Säuberli et al., 11 Apr 2024).
Ranking and Selection: Overgenerate-and-rank pipelines leverage distribution matching (DM) models and discriminators for candidate quality (Kumar et al., 2023, Huang et al., 24 Nov 2025).
Pedagogical Alignment: Downstream reader performance, preference studies, coverage of semantic elements, and compliance with target skill or difficulty labels are evaluated (Kumar et al., 2023, Wang et al., 2023).
Zero-Shot and Multilingual Settings: Item generation and evaluation in low-resource languages (e.g., German) use instruction-tuned LLMs and the TI protocol to quantify guessability and answerability (Säuberli et al., 11 Apr 2024).

Representative results:

Full-model BLEU-4 up to 27.23 (RCEG-SP Qwen2.5-3B) (Huang et al., 24 Nov 2025), ~2.5× improvement in distractor BLEU-4 (DGRC) (Lin et al., 29 May 2024), MAP@10 (ROUGE-L/BERTScore) above 0.57 (Yang et al., 30 Jul 2025).
Empirical human-grade item quality exceeds 93% in operational settings, with skill-controllability accuracy around 75–80% for leading models (Ma et al., 9 Jun 2025, Wang et al., 2023).

6. Current Trends, Limitations, and Future Directions

Contemporary RCEG research emphasizes:

Personalization and Adaptation: Dynamic calibration to learner proficiency, leveraging interaction history and adaptive difficulty (Huang et al., 24 Nov 2025, Yang et al., 30 Jul 2025).
Skill and Inference Taxonomy Coverage: Expansion from literal/factoid questions to full Bloom’s taxonomy, bridging, and diagnostic inference categories (Wang et al., 2023, Ma et al., 9 Jun 2025, Ghanem et al., 2022).
Pipeline Integrability: Modular designs supporting joint QG-AG-DG finetuning, multi-task objectives, and plug-and-play subcomponent replacement (Lin et al., 29 May 2024, Zhang, 2023).
Low-Resource and Multilingual Support: Zero-shot, prompt-based LLM generation across languages, with automatic evaluation frameworks decoupled from reference translations (Säuberli et al., 11 Apr 2024).
Human-in-the-Loop Validation: Operational deployment mandates robust human review, iterative prompt refinement, and distribution monitoring for inference-type balance and item security (Ma et al., 9 Jun 2025, Säuberli et al., 11 Apr 2024).

Notable limitations include incomplete skill/inference type alignment (e.g., only 42.6% inference-type match in automatic bridging-inference QG (Ma et al., 9 Jun 2025)), dependence on strong pretrained LLMs, and sensitivity to prompt engineering or data domain shift. Methods to directly optimize text informativity and reduce guessability (e.g., integrating TI as a reinforcement learning reward) are proposed as future enhancements (Säuberli et al., 11 Apr 2024).

Advances in dynamic coverage optimization, student modeling, chain-of-thought prompting, and domain adaptation will further refine automated RCEG, facilitating scalable, effective literacy assessment in multi-modal, multi-lingual, and adaptive learning environments.