MCQ Taxonomy & Question Styles

Updated 14 February 2026

MCQ Taxonomy and Question Styles are comprehensive frameworks classifying items by cognitive demand, distractor design, and answer structures for both human and LLM evaluations.
They employ psychometric methods such as GLMM analyses and IRT models to calibrate item difficulty and discrimination, ensuring reliable assessment outcomes.
Advanced approaches include generative MCQs and explanation-based formats that improve scoring validity and provide richer diagnostic insights into test-taker performance.

Multiple-choice question (MCQ) taxonomy and question style research address the formal classification, function, and construction principles of MCQs in both human- and machine-targeted assessments. MCQs are characterized by systematic design variables—such as cognitive skill target, domain specificity, distractor typology, and scoring protocol—that shape their validity, reliability, and diagnostic utility for both humans and LLMs. Recent scholarship grounds these frameworks in psychometrics, educational practice, and LLM evaluation theory (Balepur et al., 19 Feb 2025, Chen et al., 28 Jan 2026, Jonsdottir et al., 2021, Gupta et al., 2021, Xu et al., 2019).

1. Formal MCQ Taxonomies

MCQ taxonomies provide rigorous schema for classifying item structure, answer type, cognitive demand, and distractor configuration.

Answer Type and Structure

A foundational taxonomy encodes each MCQ as the tuple $(N, I_\mathrm{NOTA}, I_\mathrm{AOTA}, R_\mathrm{special})$ , where $N$ is the number of options, $I_\mathrm{NOTA}$ and $I_\mathrm{AOTA}$ are indicators for the presence of "None of the Above" or "All of the Above" options, and $R_\mathrm{special}$ denotes the role (key or distractor) of the special option. This yields four canonical MCQ styles: Standard (no special options), MCQ+NOTA, MCQ+AOTA, and Hybrid (Jonsdottir et al., 2021).

Cognitive Skill Target

Expanded frameworks leverage Bloom’s taxonomy to classify MCQs by their cognitive objective: Remember, Understand, Apply, and Analyze. These levels are operationalized via distinct question stem templates and distractor rewriting rules as follows (Chen et al., 28 Jan 2026):

Bloom Level	Question Stem	Option Construction
Remember	Identification of violated practice	Original description unchanged
Understand	Cause/effect explanation	Options rephrased as explanations
Apply	Prospective recommendation	Forward-looking actions
Analyze	Comparative evaluation	Options detailed with pros/cons

Domain-Specific Taxonomies

Hierarchical domain taxonomies assign MCQs to nested topical categories (e.g., those in science exams: Astronomy $\rightarrow$ Orbits), supporting multi-level inference and fine-grained curricular analysis (Xu et al., 2019).

2. MCQ Question-Style Classification Schemes

Question styles vary along answer type, stem focus, and expected response structure. Six coarse classes, extended by fine sub-types, are used in large-scale semantic matching and MCQ design (Gupta et al., 2021):

Coarse Type	Sub-Type Examples	MCQ Fit
Quantification	Age, Time, Number	Numeric key selection
Entity	Person, Location	Named entity selection
Definition	Entity, Concept	"What is…" type items
Description	Mechanism, Reason	Explanation/differentiation
List	Entity set, Quant. set	Select-all-that-apply
Selection	Alternative, True/False	Standard MCQ, binary

For science MCQs, question styles are further differentiated via stem phrasing and dependency-root features; item glosses and label glosses are leveraged to maximize classifier accuracy and semantic match (Xu et al., 2019, Gupta et al., 2021).

3. Distractor Taxonomies and Special Option Analysis

Distractor construction critically determines item discriminability and cognitive engagement:

Plausibility: Distractors should mirror genuine misconceptions and be homogeneous in scope.
Special Options: "All of the Above" (AOTA) and "None of the Above" (NOTA) disrupt standard guessing strategies, but their inclusion affects item difficulty and discrimination. Empirically, AOTA as a distractor increases $P$ (correct) to 0.88, while AOTA as the key reduces it to 0.79; NOTA as a distractor results in 0.82 (Jonsdottir et al., 2021).

Principled distractor design must avoid cues from surface artifacts, and the number of distractors should be capped (optimal: $k=2$ or $3$) to balance cognitive load against guessing rate.

4. MCQ Generation, Scoring, and Calibration Methodologies

Generation Pipelines

Contemporary MCQ generation leverages extraction of actionable practices, deduplication algorithms (e.g., practice retention as $\mathrm{clarity}(\hat{p}) \geq 4 \land \mathrm{similarity}(\hat{p}) \leq 2$ ), and LLM-based scenario construction. Items are psychometrically screened via GLMM analyses (Chen et al., 28 Jan 2026).

Scoring Protocols

MCQ scoring extends beyond raw accuracy. Penalty (negative) marking, probability scoring (eliciting calibrated confidences), elimination scoring, and full latent-trait calibration with Item Response Theory (IRT) are all used (Balepur et al., 19 Feb 2025). The two-parameter logistic IRT model is:

$P(\theta) = \frac{1}{1 + \exp(-a(\theta - b))}$

where $\theta$ is the agent's latent ability, $b$ is difficulty, and $a$ is discrimination. IRT enables filtering out poorly discriminative items and supports test assembly with targeted difficulty.

Psychometric Metrics

Difficulty and discrimination are modelled via GLMMs, with model- and Bloom-level discrimination defined as:

$\Delta_{\mathrm{model}}(p) = \max_m \hat{p}_{m,p} - \min_m \hat{p}_{m,p}$

$\Delta_{\mathrm{bloom}} = \max_b \hat{p}_{b} - \min_b \hat{p}_{b}$

where $\hat{p}_{m,p}$ is mean model accuracy on practice $p$ and $\hat{p}_b$ is aggregate accuracy at Bloom level $b$ (Chen et al., 28 Jan 2026).

5. Generative and Hybrid MCQ Styles

Advanced MCQ frameworks incorporate generative elements to probe open-ended knowledge and model explanation quality:

Constructed Response (CR): Removes options; LLMs must generate the key. Scoring uses automated semantic metrics or verifier models.
Explanation MCQA (E-MCQA): Requires both option selection and a free-form justification, scored for factuality, faithfulness, and plausibility, often with automated rubrics and secondary explanation verification networks (Balepur et al., 19 Feb 2025).

These approaches better align with the full range of user needs and support richer diagnostic outputs, offering partial credit and surfacing model reasoning gaps.

6. Best Practices and Practical Guidelines

Practitioner recommendations emphasize rubric-driven item writing, pre-testing and psychometric piloting, bias detection (e.g., via contrast sets), and iterative revision:

Domain Definition: Specify the target skill or knowledge domain.
Format Fit: Select MCQ, CR, or E-MCQA format based on cognitive and practical constraints.
Rubric-Guided Construction: Apply validated item-writing taxonomies to detect ambiguity, multiple keys, negative stems, and defective special options (Balepur et al., 19 Feb 2025).
Plausible Distractors: Craft distractors within the same domain, avoiding out-of-scope eliminations (Xu et al., 2019).
IRT-based Assembly: Use IRT-calibrated items to build tests at desired difficulty and discrimination levels.
Bias/Shortcut Detection: Employ adversarial or contrastive item checks to expose or correct model-usable artifacts.

For both human and LLM assessment contexts, principled MCQ taxonomy and question-style design are foundational to reliable, valid, and interpretable measurement, supporting both benchmarking and instructional application (Balepur et al., 19 Feb 2025, Chen et al., 28 Jan 2026, Jonsdottir et al., 2021, Xu et al., 2019).

Markdown Report Issue Upgrade to Chat

References (5)

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above (2025)

Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy (2026)

The effect of the number of distractors and the "None of the above" - "All of the above" options in multiple choice questions (2021)

Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy (2021)

Multi-class Hierarchical Question Classification for Multiple Choice Science Exams (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MCQ Taxonomy and Question Styles.

MCQ Taxonomy & Question Styles

1. Formal MCQ Taxonomies

Answer Type and Structure

Cognitive Skill Target

Domain-Specific Taxonomies

2. MCQ Question-Style Classification Schemes

3. Distractor Taxonomies and Special Option Analysis

4. MCQ Generation, Scoring, and Calibration Methodologies

Generation Pipelines

Scoring Protocols

Psychometric Metrics

5. Generative and Hybrid MCQ Styles

6. Best Practices and Practical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MCQ Taxonomy & Question Styles

1. Formal MCQ Taxonomies

Answer Type and Structure

Cognitive Skill Target

Domain-Specific Taxonomies

2. MCQ Question-Style Classification Schemes

3. Distractor Taxonomies and Special Option Analysis

4. MCQ Generation, Scoring, and Calibration Methodologies

Generation Pipelines

Scoring Protocols

Psychometric Metrics

5. Generative and Hybrid MCQ Styles

6. Best Practices and Practical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research