Personalized English Reading Comprehension Tests

Updated 19 November 2025

Personalized English reading comprehension tests are adaptive systems that align test item difficulty with individual proficiency using IRT and error-driven updates.
They leverage modular pipelines for proficiency prediction, adaptive item selection, and automated evaluation to provide targeted learning feedback.
Emerging frameworks integrate large language models for transcreation of passages and questions, enhancing engagement, comprehension gains, and diagnostic accuracy.

Personalized English reading comprehension tests are adaptive assessment systems that dynamically generate or select reading materials and associated questions, matching both the content and difficulty of each item to an individual test-taker’s abilities, history, or interests. These frameworks operationalize concepts from educational psychology, item response theory (IRT), natural language processing, and LLM–based question generation to maximize diagnostic value, engagement, and learning gains for diverse populations including English as a Foreign Language (EFL) and K–12 learners (Wang et al., 2023, Huang et al., 2018, Han et al., 12 Nov 2025).

1. Theoretical Foundations and Motivation

Personalization in reading comprehension assessment is motivated by substantial individual variability in vocabulary knowledge, schematic background, and inference skill. One-size-fits-all tests can be demotivating: advanced readers may be under-challenged, while beginners may face frustration and disengagement. The prevailing theoretical underpinning is Vygotsky’s Zone of Proximal Development (ZPD), which stipulates that optimal learning occurs when a learner is challenged just beyond their current level with appropriate scaffolding. This requires accurate estimation of individual proficiency and continuously aligning assessment item difficulty to current ability estimates: specifically, targeting items whose calibrated difficulty parameter $b$ satisfies $b \approx \theta + \delta$ for a small positive $\delta$ (“challenge margin”) (Wang et al., 2023).

2. System Architectures: Components and Pipelines

Most personalized reading comprehension systems are modular, comprising key components for candidate and learner modeling, adaptive item selection or generation, response evaluation, and continuous updating of proficiency estimates. Pipeline data flow is often as follows:

Proficiency Prediction: Consumes historical response logs to estimate a latent ability parameter, typically denoted $\theta$ .
Item Selection and/or Generation: Uses $\hat\theta$ to select or create new items with target difficulty $b\approx\hat\theta+\delta$ .
Automated Evaluation: Scoring module provides correctness/partial credit and diagnostic feedback.
Update Loop: Incorporates latest response/outcome into the proficiency estimator, and iterates (Wang et al., 2023, Huang et al., 2018, Yang et al., 30 Jul 2025).

A distinctive recent trend is the integration of generative LLMs (LLMs, e.g., GPT-4o, ChatGPT, T5, Llama) for both crafting reading passages (including interest-aligned “transcreation”) and generating multiple-choice or open-ended questions at parameterizable difficulty or cognitive demand (Wang et al., 2023, Han et al., 12 Nov 2025, Yang et al., 30 Jul 2025).

3. Algorithms for Proficiency and Difficulty Calibration

Proficiency modeling and difficulty ranking are central technical challenges.

a. Proficiency Estimation

Systems typically use an IRT model, most commonly the two-parameter logistic model (2PL) with recency/adaptivity weighting: $P(\text{correct}_i \mid \theta, b_i, a_i) = \frac{1}{1 + \exp(-a_i (\theta - b_i))}$ $\theta$ is the test-taker’s ability, $b_i$ the item’s difficulty, and $a_i$ its discriminativity. Proficiency updates are computed as: $\hat\theta^{(t+1)} = \hat\theta^{(t)} + \lambda \sum_{i=1}^N [u_{i} - P(\text{correct}_i \mid \hat\theta^{(t)}, b_i, a_i)]$ where $u_{i}\in\{0,1\}$ records correctness and $\lambda$ is a learning/recency rate (Wang et al., 2023, Huang et al., 2018).

b. Item Difficulty Estimation and Ranking

Difficulty ( $z_i$ ) can be derived from historic human performance (Rasch score), but automated approaches use:

Level-Classifiers: Transformer-based classifiers (e.g., ELECTRA) produce probability $p^{\text{easy}},p^{\text{med}},p^{\text{hard}}$ , converted to scalar $\hat z_{i}$ .
Zero-shot LLM Scoring: LLMs compare items in pairs or provide scalar ratings; comparative scoring achieves highest alignment with gold standards ( $\rho=0.404$ Spearman; combining with classification gives $\rho=0.437$ on MCQRD) (Raina et al., 2024).
LLM Difficulty Control in Generation: Prompts specify cognitive demand and explicit difficulty bands; outputs are filtered by predicted alignment with learner profiles (Wang et al., 2023, Yang et al., 30 Jul 2025).

4. Methods of Personalization: From Skill to Interest

Personalization operates at several layers:

Skill/Proficiency Matching: After initial calibration with 5–10 items spanning the difficulty spectrum, subsequent items are algorithmically selected/generation-matched to $\theta$ .
Error Log–Driven Intervention: Error-driven systems log failed concepts (e.g., specific grammar constructions, vocabulary bands, coreference types) and prioritize their repetition in future quizzes to promote rectification (Huang et al., 2018).
Interest Alignment: Recent work “transcreates” passages by embedding both source topic labels and learner interest profiles into a semantic space and maximizing an alignment function,

$A_i(p, t') = \cos(\mathbf{v}_t, \mathbf{v}_{t'}) \cdot u_i(t')$

where $u_i$ is the learner’s 33-dimension topic preference and $\mathbf{v}_t$ are topic embeddings (Han et al., 12 Nov 2025). Multiple-choice questions are transcreated to preserve cognitive demand (Bloom’s level) and key linguistic features. Controlled trials show significant gains in comprehension and motivation for interest-aligned tests.

Personalization Dimension	Operationalization in Recent Systems	Example Approach
Ability/Proficiency	IRT-based modeling, online $\theta$ update	(Wang et al., 2023, Huang et al., 2018)
Error-driven remediation	Error logs, algorithmic review scheduling	(Huang et al., 2018)
Interest alignment	Topic profile–content matching, transcreation	(Han et al., 12 Nov 2025)

5. Automated Generation and Evaluation of Items

Automated reading comprehension test generation encompasses both question and passage generation, vocabulary assessment, and automated scoring:

Question Generation: Prompt engineering guides LLMs to create MCQs (literal, vocabulary-in-context, inference, critical reasoning) or open-ended questions with explicit difficulty/type parameters. Example prompt:

You are an expert English teacher.
Passage: "{full_passage_text}"
Task: Generate one inference question that tests reading between the lines.
Difficulty: Level L3 (moderate inference)
Format:
  1. Question stem.
  2. Four options (A–D).
  3. Provide correct answer and rationale.

(Wang et al., 2023).

Vocabulary Assessment: Systems like K-tool (Flor et al., 24 May 2025) cluster nominal multi-word expressions in a passage using embedding/affinity propagation, select related and unrelated terms (by vector similarity, PNPMI), and generate forms that measure background topical knowledge—a proxy for likely comprehension success.
Automated Evaluation: OpenAI models are prompted as “strict but fair graders” (returns JSON with score/feedback); scoring for open-ended items may use a rubric function,

$\text{score} = \min\left(4, \sum_{k=1}^m \mathbb{1}[\text{point}_k\ \text{covered}]\right)$

(Wang et al., 2023). Automatic MCQ scoring leverages direct pattern matching or LLM evaluation (Yang et al., 30 Jul 2025).

6. Empirical Validation, Results, and Effectiveness

Empirical evaluation combines learning gain measurement, item analysis, and user study methodologies.

Comprehension and Motivation: Controlled experiments with interest-aligned transcreated passages found that personalized groups exhibited greater gains in comprehension (e.g., $+13.3$ in total score, $p<.01$ ) and higher retention of motivation (no significant drop in IMMS scores), relative to random-topic controls (Han et al., 12 Nov 2025).
Diagnostic and Remedial Gains: Personalized, error-driven systems yielded higher rectification rates for repeated misconceptions ($0.54$ vs. $0.10$, $p<.001$ ), and stronger pre–post gains in experimental groups (Wilcoxon, $p<.01$ ) (Huang et al., 2018).
Difficulty Calibration: Agreement between LLM-difficulty and human annotation achieved Cohen’s $\kappa = 0.72$ (Wang et al., 2023); comparative zero-shot LLM ranking aligns with gold-standard difficulty at $\rho=0.404$ (Raina et al., 2024).

7. Limitations and Future Directions

Current personalized systems face several open challenges:

Difficulty Calibration: Accurate alignment of model-predicted item difficulty with actual human performance requires large, well-annotated calibration datasets.
Quality of LLM Outputs: LLM-generated distractors sometimes hallucinate or introduce ambiguity; adversarial filtering and human-in-the-loop validation remain necessary (Wang et al., 2023, Han et al., 12 Nov 2025).
Interest Modeling: Alignment by broad topic may miss fine-grained interest distinctions (“tennis skill” vs. “tennis culture”); future systems may require richer interest elicitation (Han et al., 12 Nov 2025).
Adaptive Delivery: Real-time adaptation to detected misconception, balancing motivation and challenge, and minimizing engagement drop-off remain active areas (Huang et al., 2018).
Multimodal and Multilingual Extensions: Future systems may incorporate audio/text modalities and scale to non-English or discipline-specific domains (Wang et al., 2023, Han et al., 12 Nov 2025).

This summary covers best practices and architectures for the design and evaluation of personalized English reading comprehension assessments, integrating insights from contemporary work on adaptive proficiency modeling, difficulty calibration, content-interest alignment, automated generation, and feedback loops in LLM-driven pipelines (Wang et al., 2023, Raina et al., 2024, Huang et al., 2018, Flor et al., 24 May 2025, Han et al., 12 Nov 2025, Yang et al., 30 Jul 2025).