Automated Subjective Question Generation
- Automated Subjective Question Generation is the algorithmic process of producing open-ended, higher-order questions that require analysis, evaluation, and creative synthesis.
- It employs diverse methods like template extraction, transformer-based generation, and generate-then-validate pipelines to ensure contextually relevant and scalable assessments.
- The approach underpins adaptive educational systems by aligning generated questions with Bloom’s taxonomy and validating quality through both human and automated evaluations.
Automated subjective question generation refers to the algorithmic creation of open-ended, higher-order questions that target analysis, evaluation, synthesis, or reflection, as contrasted with objective, fact-based prompts. This area has emerged as a critical subfield within educational natural language processing and intelligent tutoring systems, aiming to support scalable, reliable, and contextually valid assessment and dialogic engagement in academic, professional, and informal learning contexts.
1. Formal Definitions and Scope
Automated subjective question generation (ASQG) is defined as the process by which an algorithm receives an information source (e.g., instructional text, passage, or learning objective) and produces one or more open-ended or higher-order prompts. These prompts demand interpretive, analytical, evaluative, or generative responses, mapped to the upper Bloom's taxonomy levels: Analyze, Evaluate, and Create (Scaria et al., 8 Aug 2024, Islam et al., 19 Dec 2025). In formal notation, the task can be cast as a function
where output questions require answers that cannot be ascertained by retrieval of a unique fact. Subjective question generation is typically contrasted with the objective case, where questions elicit constrained or factoid responses (Chhabra et al., 2022).
This field encompasses template-based linguistic methods, neural sequence generation, in-context prompting, retrieval-augmented systems, hybrid pipelines, and unsupervised rule-mining strategies (Narayanan et al., 2023, Maity et al., 29 Jan 2025, Scaria et al., 8 Aug 2024).
2. System Architectures and Methodological Taxonomy
Approaches to ASQG generally fall into five methodological paradigms:
- Template-driven extraction and paraphrasing: Systems such as GenQ extract contextually rich question templates from human-written data, replacing content words with part-of-speech (PoS) tags to create slot-based templates. Candidate questions are instantiated by matching templates to novel contexts and are then paraphrased via pretrained transformers to increase fluency and naturalness (Narayanan et al., 2023).
- Hybrid unsupervised pipelines: Obj2Sub exemplifies a pipeline comprising rule-based transformation of objective to subjective questions, knowledge graph–based retrieval (e.g., “People Also Ask”), neural augmentation (T5 seq2seq), and dense semantic reranking. This leverages the strengths of explicit patterns, real-world question analogues, and neural diversity, while maintaining scalability without labeled pairs (Chhabra et al., 2022).
- Transformer-based sequence generation: Modern systems fine-tune or instruction-tune autoregressive LLMs (e.g., Mistral-7B, LLaMA-2, GPT-3.5/4) on synthetic or human-annotated subjective QA pairs, often incorporating metadata such as Bloom's level, domain, and difficulty (Islam et al., 19 Dec 2025, Thakur et al., 28 Sep 2025, Scaria et al., 8 Aug 2024). Prompt engineering with multi-step “chain of thought,” definition-based, or exemplar-based inputs can significantly affect Bloom-level targeting and output diversity (Scaria et al., 8 Aug 2024).
- Generate-then-validate pipelines: Small LLMs (e.g., Phi-2) are harnessed using an abundance-first approach: diverse open-ended candidates are generated, then validated or filtered using probabilistic confidence, syntactic criteria, and objective relevance scores ( for alignment with the intended learning objective) (Wei et al., 10 Dec 2025). This architecture has been shown to produce high-quality subjective questions when guided by strong filtering.
- Pedagogically guided and RL-augmented generation: In domains such as mathematics, subjective (Socratic) subquestion generators condition or plan over problem-specific operators/equations, incorporate reinforcement learning (RL) rewards for fluency, granularity, and answerability, and optimize both automatic and human-aligned metrics (Shridhar et al., 2022).
3. Cognitive Frameworks and Pedagogical Taxonomies
A recurring operationalization of subjectivity and higher-order question properties is the revised Bloom's taxonomy (Scaria et al., 8 Aug 2024, Thakur et al., 28 Sep 2025, Islam et al., 19 Dec 2025). This hierarchy’s upper tiers—Analyze (), Evaluate (), Create ()—are explicitly targeted during both model training and prompt engineering. Many pipelines explicitly annotate each instance in their training corpora with the intended cognitive level, which then informs generation, filtering, and evaluation steps. This alignment is crucial to ensuring that generated questions elicit non-factoid responses and support assessment or dialog consistent with educational outcomes.
The following table collates the core Bloom levels used in these systems:
| Level | Name | Typical Question Stem/Task |
|---|---|---|
| Analyze | "How would you differentiate…?", "Analyze the…" | |
| Evaluate | "Critique…", "Assess the merits of…" | |
| Create | "Design a pipeline for…", "Propose a novel…" |
4. Evaluation Protocols and Empirical Findings
ASQG evaluation spans human, automated, and hybrid protocols:
- Human rubrics and expert ranking: Multi-point rubrics combining grammaticality, clarity, contextual relevance, answerability, and adherence to intended Bloom level are standard (Scaria et al., 8 Aug 2024). Cohen's κ and quadratic-weighted κ are frequently reported for inter-rater and model-versus-human alignment.
- Automated and LLM-based evaluation: Some works utilize instruction-tuned LLMs (e.g., Gemini Pro, GPT-4) as surrogate judges, leveraging high κ agreement with human ratings for quality and alignment (Scaria et al., 8 Aug 2024, Wei et al., 10 Dec 2025, Islam et al., 19 Dec 2025).
- Semantic and diversity metrics: PINC scores, BERTScore, cosine similarity, and perplexity thresholds are used to measure linguistic novelty, contextual alignment, and fluency (Scaria et al., 8 Aug 2024, Thakur et al., 28 Sep 2025).
- Learning outcome and engagement: Platforms such as AnveshanaAI measure not only question quality but also downstream engagement (streak length, completion rate) and assessment reliability (inter-rater agreement, e.g., Cohen’s κ=0.81 for essay rubric scores) (Thakur et al., 28 Sep 2025).
Empirical benchmarks indicate that instruction-tuned or prompt-optimized LLMs such as GPT-4 and Mistral 7B achieve upwards of 89% “high‐quality” subjective question generation rates, with Bloom-level (Skill) matches exceeding 70% (Scaria et al., 8 Aug 2024, Islam et al., 19 Dec 2025). Hybrid pipelines and validate-and-filter models further enhance precision and alignment.
5. Cultural Context, Domain Adaptivity, and Practical Application
Cultural and domain adaptation in ASQG systems remains a focus for both data collection and system configuration. GenQ demonstrates that caregiver-question styles vary significantly by demographic, motivating template mining stratified by group (e.g., Latinx vs. non-Latinx) (Narayanan et al., 2023). However, most systems do not embed cultural features directly at generation time, tuning adaptation primarily via template bank selection or prompt exemplars.
Platform-level deployments—such as AnveshanaAI—operationalize domain and cognitive adaptivity at runtime by tagging each context with Bloom level, domain, and difficulty, and dynamically selecting prompts for each learner. Skill-tracing and Bayesian updating facilitate progression from basic recall to open-ended, subjective challenges (Thakur et al., 28 Sep 2025).
Concrete examples for subjective questions at different Bloom levels, along with structured rubrics for scoring, are provided in both educational and technical domains. For instance:
- "Evaluate the ethical considerations and potential societal impacts of deploying LLMs such as GPT-4 in open-ended tutoring applications" (Thakur et al., 28 Sep 2025).
- "Critique the author’s argument regarding semantic drift in historical texts, supporting your points with textual evidence" (Islam et al., 19 Dec 2025).
6. Limitations, Open Problems, and Future Directions
Despite strong empirical advances, several challenges persist:
- Coverage and diversity: Many template or rule-based systems struggle with the full diversity of valid subjective question forms, particularly for under-represented syntactic or cultural patterns (Chhabra et al., 2022, Narayanan et al., 2023).
- Evaluation granularity: Reference-free automated metrics do not yet reliably substitute for expert judgment, especially for nuanced pedagogical attributes (Scaria et al., 8 Aug 2024).
- Model capacity and adaptation: Smaller models are outperformed by large LLMs in answer evaluation; however, smart fine-tuning (LoRA/QLoRA), hybrid pipelines, and generate-then-validate approaches can narrow the gap for question generation, improving accessibility for lower-resource deployments (Islam et al., 19 Dec 2025, Wei et al., 10 Dec 2025).
- End-to-end integration: Most current architectures decouple question generation from answer evaluation and rubric-based scoring, with few offering coherent closed-loop or interactive learning flows (Thakur et al., 28 Sep 2025, Islam et al., 19 Dec 2025).
Potential extensions include end-to-end fine-tuning of neural generators with direct integration of Bloom-level or demographic embeddings, collection of large-scale human-judgment data for rubric calibration, richer reinforcement learning rewards (e.g., teacher judgments, automatic difficulty estimation), and dynamic, dialogic ASQG within curriculum-aligned tutoring systems (Shridhar et al., 2022, Narayanan et al., 2023).
7. Representative Example Systems and Comparative Results
The following table summarizes salient properties and results of recent influential ASQG systems:
| System | Generation Strategy | Evaluation Highlights | Notable Findings/Strengths |
|---|---|---|---|
| GenQ (Narayanan et al., 2023) | Template extraction + paraphrase | Negative binomial regression on question counts | Cultural template mining, interpretability |
| Obj2Sub (Chhabra et al., 2022) | Hybrid rule-based + neural + retrieval | R@3=0.408, P@3=0.408 (ObjQA dev), +36% over T5 | Effective unsupervised OQ→SQ conversion |
| Socratic QG (Shridhar et al., 2022) | T5-based seq2seq + RL + planning | BLEU/BERTScore, human quality increases | Content planning, equation-based Socratic Q |
| AnveshanaAI (Thakur et al., 28 Sep 2025) | Fine-tuned LLM + Bloom meta + filtering | BERTScore F1=0.427, Cohen’s κ=0.81 (essays) | Bloom-level operationalization, adaptivity |
| AEQG LLMs (Scaria et al., 8 Aug 2024) | Prompt engineering (PS1–5) + LLMs | GPT-4: 89% high-quality, 70% Skill match | PINC 0.92+, expert-calibrated outputs |
| SLM pipeline (Wei et al., 10 Dec 2025) | Generate then probabilistically validate | Cohen’s κ (Phi-2 vs. humans): 0.76 (answers) | SLMs viable with strong validation |
| Subjective QG + eval (Islam et al., 19 Dec 2025) | LoRA/QLoRA LLM tuning, GPT-4 synthetic data | 1st-rank (QG): Mistral 7B 65.8%, GPT-3.5 30.7% | Mistral 7B SOTA for subjective QG |
References
- (Narayanan et al., 2023) GenQ: Template-based open-ended QG for caregivers
- (Chhabra et al., 2022) Obj2Sub: Unsupervised conversion of objective to subjective questions
- (Shridhar et al., 2022) Socratic subquestion generation for math
- (Thakur et al., 28 Sep 2025) AnveshanaAI: Adaptive platform, Bloom-level annotation, explainability
- (Scaria et al., 8 Aug 2024) Automated educational QG across Bloom's levels with LLMs
- (Maity et al., 29 Jan 2025) In-Context Learning and Retrieval-Augmented Hybrid QG
- (Wei et al., 10 Dec 2025) Generate-then-Validate approach with small LMs
- (Islam et al., 19 Dec 2025) LoRA/QLoRA-tuned LLMs for subjective QG and answer evaluation
These works collectively define the theoretical, algorithmic, and empirical landscape of automated subjective question generation, with continuing research focused on compositionality, evaluation reliability, multi-domain and multicultural adaptation, and dynamic, closed-loop educational integration.