Automatic Question Generation (AQG)

Updated 13 November 2025

AQG is a field that develops algorithms to automatically create questions and answers from structured or unstructured data, enhancing educational and assessment practices.
Methodologies include neural sequence-to-sequence learning, pointer networks, and template-based synthesis to ensure grammaticality, relevance, and cognitive complexity.
Evaluations leverage automatic metrics like BLEU and ROUGE alongside human assessments to measure linguistic quality, answerability, and pedagogical effectiveness.

Automatic Question Generation (AQG) is a field at the intersection of natural language processing, computational pedagogy, and artificial intelligence, concerned with the development of algorithms, models, and systems for the automatic production of questions—often alongside their answers—from unstructured or structured input data. AQG systems are critical for applications in educational technology, reading comprehension assessment, intelligent tutoring, and knowledge base construction, and are evaluated both for the linguistic quality of their outputs and their pedagogical utility in fostering comprehension or testing knowledge.

1. Task Formulations and Core Methodologies

Automatic question generation encompasses a spectrum of formulation approaches, characterized by the type of input (textual, tabular, logical, ontological), the scope and structure of the question (factoid, analytical, examination-style, MCQ), and the desired output properties (answerability, clarity, cognitive level).

AQG Task Definitions

Input Types: Ranges from sentences, paragraphs, and documents (e.g., SQuAD, RACE corpora) to knowledge graphs, ontologies (Alkhuzaey et al., 8 Apr 2025), logical formulas (Yang et al., 9 May 2024, Wang et al., 13 Oct 2025), and multi-modal data.
Output Structure: Questions may be unbounded text (open-ended QG), multiple-choice, fill-in-the-blank, or logical equivalence statements. Several pipelines produce question–answer (QA) pairs or triplets.
Task Objectives: Core desiderata include grammatical well-formedness, answerability (from supplied context), contextual relevance, cognitive appropriateness, and—especially in pedagogical contexts—explicit alignment with learning outcomes or taxonomic levels (e.g., Bloom's taxonomy (Karbasi et al., 6 Nov 2025)).

Methodological Paradigms

Sequence-to-Sequence Learning: Dominant in neural AQG (Kumar et al., 2018, Muis et al., 2020), using RNN·GRU/LSTM or Transformer encoder–decoder models (e.g., T5, BART, mT5 (Ushio et al., 2023)), often augmented with attention, pointer-generator or copy mechanisms, coverage loss, and linguistic feature channels (POS, NER, dependency features).
Pointer Networks and Answer Selection: Many pipelines decompose QG into answer selection (e.g., pointer networks to identify salient spans (Kumar et al., 2018)) followed by question realization conditioned on the answer.
Template-Based and Syntactic Approaches: Particularly in domain-specific or logic-based settings, formal grammars (Yang et al., 9 May 2024, Wang et al., 13 Oct 2025), syntactic transformation rules, and semantic enrichment (e.g., via hypernym resolution or NER (Danon et al., 2017)) are used for tractable, controlled QA generation.
Retrieval-Augmented and In-Context Learning: Recent advances include RAG (retrieval-augmented generation via dense retrievers and encoder–decoder LMs) and in-context learning with LLMs (ICL), sometimes hybridized for improved contextual relevance and pedagogical value (Maity et al., 29 Jan 2025).
Ontology-Based Generation: For structured knowledge bases, AQG effectiveness is a function of the ontology's pattern richness, class hierarchy, instantiation, and relation fan-out, operationalized by domain-specific metrics (Alkhuzaey et al., 8 Apr 2025).

2. Architectures, Pipelines, and Representative Systems

AQG systems are typically constructed as multi-stage pipelines, with architectures adapted to the nature of the data and the linguistic or pedagogical targets.

Neural Sequence Models

Two-Stage Pipelines: Joint answer selection and question generation, with pointer-network/LSTM or transformer-based answer selectors, followed by seq2seq question realization with global and/or local attention (Kumar et al., 2018). Rich features (POS, NER, dependency, BIO tags) and embedding strategies are commonly integrated (Muis et al., 2020).
Refinement Networks: Cascaded decoders (draft + refine) use dual attention—over context and prior drafts—to improve grammaticality, specificity, and answerability, with optional reinforcement tuning for fluency or answerability (Nema et al., 2019).
Multi-Agent and Iterative Frameworks: Agentic workflows with specialist Teacher, Critic, and Bloom (cognitive-demand) agents enable inference-time complexity control and collaborative incremental refinement for math or science domains (Karbasi et al., 6 Nov 2025).

Grammar-Driven and Formal-Language AQG

Grammar and Attribute-Based Synthesis: For logic education, BNF grammars and algebraic equivalence laws are encoded as rule sets; syntax trees are sampled with controlled transformations and semantic attributes to ensure correctness and uniform difficulty (Yang et al., 9 May 2024, Wang et al., 13 Oct 2025).
Syntactic and Semantic Enrichment Rules: Factoid generation employs linguistic transformation operations (e.g., subject–auxiliary inversion, WH-mapping, hypernym insertion) for factoid, entity, or concept-based QA creation in closed domains (Danon et al., 2017).

Ontology and Retrieval-Based Systems

Ontology Fitness and Question Diversity: The capacity of ontology-driven AQG to yield high-calibre, cognitively varied questions is tightly tied to population, relationship richness, and hierarchical structure, which can be quantified via a ROMEO-derived nine-metric suite (Alkhuzaey et al., 8 Apr 2025).
Large-Context and Retrieval Augmentation: With LLMs such as Gemini and GPT-4, full-book ingestion (million-token context) or RAG methods underpin scalable long-span AQG, entity-based QA generation, and side-by-side model evaluation pipelines (Bohnet et al., 31 May 2024, Maity et al., 29 Jan 2025).

3. Evaluation Protocols and Metrics

Evaluation in AQG demands both linguistic and task-specific metrics, often accompanied by human judgments.

Automatic Metrics

Surface Overlap: BLEU-n, ROUGE-L, METEOR, and NIST remain widely used but have been shown to correlate poorly with answerability and pedagogical value in AQG (Nema et al., 2018, Wang et al., 2023).
Answerability and Semantic Correctness: Hybrid metrics (e.g., Q-BLEU/Q-METEOR as δ·Answerability + (1–δ)·BLEU/METEOR (Nema et al., 2018)), as well as prompting-based LLM answerability assessment (PMAN; binary "YES/NO" answerability via chain-of-thought prompting (Wang et al., 2023)), yield much improved alignment with human annotation and should complement surface metrics.
Pedagogical/Qualitative Metrics: MIRROR employs an iterative LLM-based review loop to score questions on grammaticality, appropriateness, relevance, novelty, and cognitive complexity; iterative feedback notably raises correlation with human educators (Pearson’s r improved by 0.15–0.25 absolute compared to direct LLM scoring) (Deroy et al., 16 Oct 2024).
Fitness for Template Diversity: Ontology-based AQG is evaluated via pattern coverage, class richness, relationship diversity/coverage, average depth, and sibling fan-outness, capturing the diversity, depth, and MCQ distractor capacity of the question set (Alkhuzaey et al., 8 Apr 2025).

Human and Extrinsic Evaluation

Expert Annotation: Linguistic quality (fluency, syntactic/semantic correctness), contextual relevance, pedagogical appropriateness, and answerability are rated by multiple annotators, with inter-rater κ typically reported.
Domain-Specific Evaluation: For logical equivalence AQG, solution step complexity and alignment with historical exam difficulties are benchmarked; for math/science, meta-evaluation axes include relevance, clarity, difficulty matching, and cognitive demand (Karbasi et al., 6 Nov 2025, Yang et al., 9 May 2024).
Error and Ablation Analyses: Performance is commonly dissected by answer type (entities vs. concepts), span distance, translation artifacts, and model ablations (loss of pointer/copy/coverage mechanisms, or removal of enrichment steps) (Muis et al., 2020, Jia et al., 2020).

4. Key Findings, Limitations, and Impact in Practice

Research across AQG subfields has yielded several robust findings:

Neural Advances and Remaining Challenges: Transformer models (BART, T5, mT5, Gemini 1.5 Pro) exhibit strong performance, but struggle with answer selection, question focus, and reasoning over long/contextually diffuse passages—especially as answer length and conceptual abstraction increases (Mishra et al., 2020, Jia et al., 2020).
Hybrid and Iterative Strategies Excel: In educational settings, ICL and hybrid RAG+ICL approaches using GPT-4 or similar LLMs can outperform vanilla seq2seq training on grammaticality, relevance, and appropriateness, provided high-quality, stratified in-context examples and retrieval corpora are available (Maity et al., 29 Jan 2025).
Evaluation Limitations and Progress: Relying solely on surface metrics is now considered insufficient; answerability-focused and multi-dimensional LLM-in-the-loop evaluation (PMAN, MIRROR) provide a closer surrogate to human conceptual judgment (Nema et al., 2018, Wang et al., 2023, Deroy et al., 16 Oct 2024).
Complexity and Cognitive Control: For domains with explicit difficulty progression (e.g., mathematics, logic), rule-based and multi-agent systems can ensure output matches targeted complexity profiles and cognitive buckets, fostering both practice at scale and defense against academic dishonesty (Karbasi et al., 6 Nov 2025, Yang et al., 9 May 2024, Wang et al., 13 Oct 2025).
Ontology-Aware Approaches: Question diversity, MCQ quality, and conceptual depth depend strictly on properties of the underlying ontologies; optimally suitable ontologies are those with pattern-completeness, class/instance population, high relation and sibling richness, and sufficient taxonomic depth (Alkhuzaey et al., 8 Apr 2025).

5. Open Research Problems and Prospective Directions

Despite significant progress, AQG remains an evolving research area with several persistent challenges:

Content Selection and Pedagogical Targeting: Explicitly aligning generated questions with learning objectives, taxonomic levels, and cognitive load remains under-explored; scalable, theory-grounded methods for content selection and adaptive formulation are needed.
Cross-Lingual and Multimodal AQG: Adapting to low-resource languages, multimodal documents, and domain-specialized contexts continues to pose significant modeling and evaluation difficulties (Kumar et al., 2019, Muis et al., 2020).
Robust Evaluation: Human-in-the-loop and LLM-centric evaluation frameworks are needed for nuanced properties (e.g., fairness, diversity, scaffolding, answerability across paraphrastic variance) but must be made reproducible, cost-effective, and bias-resistant (Wang et al., 2023, Deroy et al., 16 Oct 2024).
Personalization and Anti-Plagiarism: Techniques for generating individualized, isomorphic-difficulty questions—especially in online learning and assessment scenarios—must integrate cryptographic randomness, formal grammars, and step-controlled transformations (Yang et al., 9 May 2024, Wang et al., 13 Oct 2025).
Scalability and Integration with Downstream Systems: AQG systems should increasingly interface with retrieval pipelines, knowledge bases, and adaptive assessment engines, with research needed on the transferability of synthetic QA data for downstream QA or instructional impact (Bohnet et al., 31 May 2024).

6. Resources and Benchmarks

The AQG field benefits from a range of public datasets, toolkits, and frameworks:

Resource	Description	Reference
SQuAD, RACE	Large-scale reading comprehension/curriculum QA	(Jia et al., 2020)
HotpotQA, DROP	Multihop/analytical QA for reasoning assessment	(Nema et al., 2019)
QG-Bench, TyDiQA	Multilingual QA resource sets	(Ushio et al., 2023)
Long-Answer QG	Benchmark for long-context answer QG	(Mishra et al., 2020)
lmqg, AutoQG	Multilingual AQG training and app toolkit	(Ushio et al., 2023)
PMAN, MIRROR	LLM-based evaluation pipelines for answerability,	(Wang et al., 2023, Deroy et al., 16 Oct 2024)
Ontology QA	Nine-metric fitness suite for ontology AQG	(Alkhuzaey et al., 8 Apr 2025)

Best practices entail rigorous benchmarking on linguistically and pedagogically diverse datasets, careful separation of answer selection and question realization evaluation, and regular human- or LLM-mediated auditing for non-superficial quality indicators.

In summary, AQG research has matured from template-based, linguistically deterministic approaches through neural sequence learning to hybrid, multi-agent, and ontology-informed systems, with concomitant advances in both automatic and human-aligned evaluation. Key research frontiers include pedagogically meaningful content selection, cross-lingual and domain adaptation, personalized and complexity-controlled generation, and integration with learning analytics and assessment pipelines.