Question & Answer Taxonomy

Updated 22 March 2026

Question and Answer Taxonomy is a structured framework that organizes questions and answers using well-defined linguistic, cognitive, and task-based criteria.
It supports systematic QA analysis, model conditioning, and retrieval, thereby enhancing performance in natural language processing and dialogue systems.
Employing rule-based and supervised methods, the taxonomy underpins robust QA pair generation, evaluation, and benchmarking in contemporary research.

A question and answer (QA) taxonomy is a structured classification scheme that organizes questions and/or their associated answers according to well-defined linguistic, cognitive, or task-based criteria. Such taxonomies enable systematic analysis, generation, retrieval, and evaluation of QA pairs in natural language processing, educational technology, open-domain question answering (ODQA), and dialogue systems. The following article provides a comprehensive, technically rigorous overview of QA taxonomies, encompassing definitional foundations, representative hierarchical schemes, cognitive and semantic dimensions, methods of taxonomy deployment in model architectures, and their empirical and benchmarking role in the landscape of QA research.

1. Foundations and Theoretical Origins

The construction of QA taxonomies draws on linguistics, cognitive science, educational pedagogy, and information retrieval. Classical antecedents include Lehnert’s (1978) conceptual question classes, which categorize questions by their informational role—such as causal antecedents (“Why did X happen?”), instrumentals (“How was this done?”), quantification (“How many...?”), feature specification (“What color was...?”), and judgment (“Do you think...?”) (Krishna et al., 2019). These foundational schemes inform modern, operational taxonomies that focus on tractable categories tailored for computational modeling.

In educational and cognitive contexts, Bloom’s Taxonomy and its Anderson & Krathwohl revision provide a six-level hierarchy—Remember, Understand, Apply, Analyze, Evaluate, Create—mapping increasing cognitive complexity to question types (Bates et al., 2013, Sahu et al., 2021). QA taxonomies leveraging these hierarchies can align both question classification and system evaluation with pedagogical goals and cognitive skill assessment.

2. Major Taxonomic Schemes

Several complementary taxonomies are prevalent in academic QA research. The following table contrasts prominent schemes by their design axis and application domain.

Taxonomy	Core Distinction	Top-Level Classes	Use Cases
Lehnert/Krishna–Iyyer	Informational role, specificity	General / Specific / Yes–No	QA hierarchy generation, RC datasets
Bloom/Anderson–Krathwohl	Cognitive process	Remember, Understand, ..., Create (6 levels)	Educational QA, cognitive evaluation
Entity/Type–based (Gupta et al., 2021)	Answer type, expected entity	Quantification, Entity, Definition, etc.	Semantic matching, IR systems
Dialogue discourse (Cruz-Blandón et al., 2019)	Syntactic/pragmatic discourse role	WH, Yes–No, Disjunctive, Phatic, Completion	Dialogue annotation, conversational QA
ODQA modality–complexity (Srivastava et al., 2024)	Input/output modality, reasoning complexity	Factoid, Long-form, Ambiguous, Multi-hop, etc.	Dataset/benchmark taxonomy
Domain-specific (Science) (Xu et al., 2019)	Problem/conceptual domain (hierarchical)	E.g., Life Science→Photosynthesis (6 levels)	Domain-specific QC, curriculum QA

Each scheme reflects the underlying functional, cognitive, or structural properties of questions and their answers, with ontological granularity determined by the target application and data regime.

3. Taxonomy Construction: Methods and Operationalization

QA taxonomies can be derived via manual, rule-based, or statistical learning approaches:

Rule-Based Labeling: Syntactic heuristics (e.g., root wh-word, sentence templates, presence of disjunction) and dependency parses are used to assign specificity or type labels, as demonstrated by Krishna & Iyyer (Krishna et al., 2019) and Cruz Bland et al. (Cruz-Blandón et al., 2019).
Supervised Learning: When rules are insufficiently comprehensive, annotated question corpora enable training of classifiers (e.g., CNN over ELMo embeddings for specificity, (Krishna et al., 2019); BiGRU and convolutional nets, (Gupta et al., 2021)).
Hierarchical Annotation: Domain experts build nested taxonomies spanning multiple levels, as in the 6-level, 406-leaf science question taxonomy in (Xu et al., 2019). Dual-annotator protocols and adjudication improve reliability, with inter-annotator agreement quantified (e.g., κ = 0.85 at L1 granularity).

Operational application to new QA pairs typically involves:

Parsing and feature extraction (syntactic, semantic).
Assignment of type(s) based on precedence filters or classifier outputs.
For hierarchical or multi-label taxonomies, mapping to all relevant nodes (e.g., up to two leaf labels in science domain (Xu et al., 2019)).
Deployment in downstream QA generation, retrieval, or evaluation workflows.

4. Taxonomy-Centered Model Architectures and Applications

QA taxonomies are integral to multiple modeling pipelines:

Conditioned Generation: In SQUASH (Krishna et al., 2019), specificity labels (general/specific/yes–no) control conditional question generation via neural encoder–decoder models, yielding hierarchical QA trees.
Semantic Matching and Retrieval: Incorporation of coarse/fine taxonomy classes and focus-word similarity augments neural semantic encoders (e.g., RCNN-Attention), yielding up to +2–8 point absolute accuracy gains in QA pair matching (Gupta et al., 2021).
Multi-Class/Hierarchical Classification: Deep models (e.g., BERT-QC) trained on question–domain taxonomies produce fine-grained topic labels, improving both answer accuracy and automated error analysis in domain QA (Xu et al., 2019).
Embedding/Metric Learning: TagRec reformulates QA→label assignment as a similarity optimization—aligning contextualized question–answer embeddings with hierarchical label embeddings via margin-based cosine similarity objectives. This model achieves up to +6% Recall@K over previous best methods and robustly generalizes to previously unseen labels (V et al., 2021).
Dialogue Annotation: Dual-layer semantic schemes annotate question types and answer types (affirmation, denial, feature, phatic, etc.), supporting discourse-level QA modeling (Cruz-Blandón et al., 2019).

5. Cognitive and Semantic Dimensions

The cognitive complexity and semantic role of QA pairs are central axes in taxonomy design:

Cognitive Level (Bloom): Empirically, only 3–5% of student-generated questions cluster at the highest ‘Create’ level, with the bulk in ‘Apply’ (40–45%) and ‘Analyze’ (30%) (Bates et al., 2013). Systems built for cognitive evaluation or educational feedback (e.g., (Sahu et al., 2021)) map questions to these levels and can drive curriculum-aligned scaffolding, performance analytics, and clarifying context for LLMs (“proximal context”).
Semantic Role: For dialogue and information extraction, mapping to semantic roles (theme, agent, temporality, reason) enables finer control over answer expectation, focus, and discourse structuring (Cruz-Blandón et al., 2019, Gupta et al., 2021).

6. Benchmarking, Evaluation, and Challenges

Taxonomies are tightly linked to dataset design and evaluation frameworks:

ODQA Datasets: Recent meta-analyses categorize datasets by modality (text, text+image, text+tables) and question complexity (short-form, long-form, ambiguous, multi-hop, conversational, cross-lingual, time-sensitive, paraphrased, counterfactual) (Srivastava et al., 2024).
Metric Taxonomy: Evaluation metrics are classified as human (expert rating, HEQ), lexical (accuracy, EM, F1, ROUGE, BLEU), semantic (BERTScore, MoverScore), or LLM-based (GPTScore, G-Eval). Lexical overlap metrics are reliable for factoid QA but insufficient for generative or ambiguous tasks; semantic and LLM-based scores mitigate some limitations but remain sensitive to hallucinations and prompt artifacts (Srivastava et al., 2024).
Quality Criteria: Combined rubrics, such as those in (Bates et al., 2013), mandate minimum cognitive and explanatory levels, clarity, distractor plausibility, correctness, and originality—formalized as HighQuality(T,E,C,D,R,O) = [T ≥ 2] ∧ [E ≥ 2] ∧ [C = 1] ∧ [D = 1] ∧ [R = 1] ∧ [O = 1].
Open Challenges: Robust assessment of factuality and non-hallucination in long-form QA, expansion of complex/multimodal benchmarks, and automation of cognitive-level annotation are current research frontiers (Srivastava et al., 2024, Sahu et al., 2021).

7. Implications and Outlook

QA taxonomies operationalize decades of theoretical work across linguistics and education into practical annotation, retrieval, and modeling tools for NLP. By integrating hierarchical, specific, and cognitive labeling, systems gain not only in downstream accuracy, but also in interpretability and pedagogical alignment. Despite consistent empirical gains (e.g., +1.7% P@1 in MCQA with label-based query expansion (Xu et al., 2019); +6% Recall@K TagRec over prototypes (V et al., 2021)), limitations persist—taxonomy coverage, reliance on supervised data, depth of hierarchy, and adaptation to cross-modal or cross-lingual QA.

A plausible implication is that continued refinement of QA taxonomies—especially those which couple hierarchical structure, semantic role, and cognitive depth—will remain foundational for scalable, robust, and explainable QA systems. Advances are likely in automating large-scale fine-grained labeling, optimizing taxonomy-aware neural representations, and developing benchmarks that demand detailed reasoning across modalities and cognitive spectra.