Closed-Ended QA: Approaches, Metrics, and Applications
- Closed-ended QA is a paradigm where systems select responses from a predetermined set, ensuring clear, automated evaluation of factual accuracy.
- It encompasses formats like extractive span selection, multiple-choice, yes/no, and classification, with applications in NLP, healthcare, and vision-language tasks.
- Methodologies such as neural span extractors, meta-learning, and RL-based search optimize precision and tackle domain-specific challenges in closed-ended QA.
Closed-ended Question Answering (QA) refers to a class of question answering tasks in which the system must select a response from a predetermined, finite set of possible answers. These tasks contrast with open-ended QA, where outputs can be arbitrarily diverse or unconstrained in form. Closed-ended QA spans formats such as extractive span selection, multiple-choice, yes/no, and finite-label classification, and is a foundational paradigm both in natural language processing and vision-language modeling domains. Its essential characteristic is that factual correctness can be unambiguously determined through automated accuracy metrics.
1. Formal Definitions and Subtypes
Closed-ended QA encompasses several well-defined task settings:
- Extractive QA: Given a context passage and a question , the system identifies an answer span within . Models are trained to maximize over all candidate spans, with supervision provided by gold spans (Lewis et al., 2019).
- Multiple-Choice QA (MCQA): Questions are accompanied by a fixed set of options ; the model selects one (or, less commonly, more) as correct. This is the dominant format in medical, legal, and educational QA datasets (Olatunji et al., 2024, Arias-Duart et al., 10 Feb 2025).
- Binary and Boolean QA: Special case where the answer set is . Yes/no QA is prevalent in conversational, clinical, and visual domains (Hwang et al., 2022, Lu et al., 5 Jan 2026).
- Closed-book QA: The model answers questions from parametric memory only, with no access to external sources at inference time. This has been used to probe factual recall in large pre-trained LMs (Wang et al., 2021).
- Classification QA: The model assigns one or more category labels (ICD codes, clinical attributes) to a passage or image. This is standard in medical record analysis and multimodal vision-language question answering (Yim et al., 30 Dec 2025, Lu et al., 5 Jan 2026).
2. Task Construction, Benchmarking, and Evaluation
Large-scale closed-ended QA benchmarks span multiple domains:
- Healthcare/Medicine: MCQA over 5-option clinical exam items (MedMCQA, PubMedQA, AfriMed-QA, CareQA-Close) is standard, with accuracy as the core metric. Label frequency imbalance, regional bias, and specialty-specific error profiles are quantitatively tracked (Olatunji et al., 2024, Arias-Duart et al., 10 Feb 2025).
- Vision-Language: Closed-ended QA is often formulated as multi-class classification over images, such as DermaVQA-DAS (with Dermatology Assessment Schema, DAS) and CTIS-Bench (using CAP-derived clinical report templates). Per-question accuracy and balanced accuracy are reported; answer sets are strictly aligned with clinical descriptors (Yim et al., 30 Dec 2025, Lu et al., 5 Jan 2026).
- Conversational QA: Systems generate a mixture of open, closed-ended (yes/no), and unanswerable question–answer pairs. MultiCQAG synthesizes such datasets with explicit answerability classifiers for automatic filtering (Hwang et al., 2022).
- General/Natural Language: Extractive QA datasets (SQuAD, NQ, HotpotQA) use exact-match (EM) and F1 as metrics, with closed-ended evaluation predicated on span selection (Lewis et al., 2019, Mei et al., 22 May 2025).
Evaluation for all task variants is dominated by accuracy measures, either EM (for span selection), plain accuracy for MCQA, or per-class variants (precision/recall/F1) if the classification is multi-label or imbalanced (Arias-Duart et al., 10 Feb 2025, Yim et al., 30 Dec 2025, Olatunji et al., 2024). Automated evaluation is a central advantage, supporting large-scale benchmarking without human raters (Arias-Duart et al., 10 Feb 2025).
3. Model Architectures and Methodological Innovations
Approaches to closed-ended QA are closely tied to the structure of the answer space:
- Span Extractors: Neural models (e.g., BERT-based) predict start and end indices via cross-entropy minimization over answer spans (Lewis et al., 2019). Unsupervised and synthetic data generation (e.g., via cloze translation) can be used to bootstrap extractive QA systems without labeled triples, yielding significant F1 on SQuAD without a single human-annotated instance.
- Meta-Learning for Closed-Book QA: The MetaQA framework uses a meta-classifier (MAML-based) to "intuitively" predict question category and augment the main reader with a small support set of example QAs from the predicted category, boosting accuracy on hard standardized exams (e.g., ARC) to near retrieval-based system levels, despite lacking access to an external corpus (Zheng et al., 2020).
- LM-based Classification and Generation: Generative LM approaches (e.g., BART, GPT-4) can be adapted for closed-book QA by fine-tuning on question–answer pairs only, with specific methods (e.g., QA-bridge-tuning, explicit answer retrieval generation) introduced to mitigate catastrophic forgetting and improve use of memorized knowledge (Wang et al., 2021).
- Agent-Based Search: O-Searcher couples an RL-trained policy with a local simulated search environment, optimizing a "factual reward" that only credits final, exactly matching answers. By leveraging retrieval-and-generation cycles, the system achieves state-of-the-art EM on multiple closed-ended QA benchmarks—even outperforming much larger models with careful reward tuning and search policy learning (Mei et al., 22 May 2025).
- Vision-Language Fusion: Models such as CTIS-QA exploit dual-stream encoders (global slide-level context via clustering, and local attention-guided patch perception) for pathology slide QA, while DermaVQA-DAS benchmarks LLM-integrated VQA systems on multiple-choice dermatology assessment tasks in bilingual settings (Yim et al., 30 Dec 2025, Lu et al., 5 Jan 2026). Performance varies across question type; model size and careful prompt engineering remain significant factors.
4. Synthetic Data Generation and Quality Control
Synthetic generation is critical for both data-efficient model training and scalable evaluation in domains with paucity of labeled data:
- Unsupervised QA Data: Pipelines synthesize (context, answer, question) triples by random context sampling, linguistic answer tagging, cloze formation, and translation into natural questions. Unsupervised methods can yield extractive QA models that outperform early supervised baselines, e.g., 56.4 F1 on SQuAD v1 with zero manual annotation (Lewis et al., 2019).
- Conversational Data Synthesis: MultiCQAG intersperses open, closed-ended (binary), and unanswerable questions, using a two-phase ALBERT-based answerability classifier (context-level and passage-level) to filter misaligned or spurious Q–A pairs. Closed-ended synthesis alone boosts closed-ended F1 from 4.2 to 74.6, nearly bridging the gap with human-annotated baselines (Hwang et al., 2022).
- Vision-Language Data Templates: Design of closed-ended schemas (DAS, CPRT) imposes clinically relevant structure and enables standardization in annotation and model evaluation. Majority-vote adjudication, clinician curation, and bilingual question presentation enforce annotation quality and generalizability in VQA (Yim et al., 30 Dec 2025, Lu et al., 5 Jan 2026).
- MCQA Benchmark Curation in Low-Resource Settings: Datasets such as AfriMed-QA are curated from diverse geographic and specialty sources, with explicit annotation of rationales and rigorous clinician vetting. Coverage is balanced to sample key specialty areas and interrogate region-specific phenomena (Olatunji et al., 2024).
5. Empirical Performance, Failure Modes, and Analysis
Across closed-ended QA settings, empirical trends are robust and well-characterized:
| Domain | Closed-Ended QA Setup | Best Reported Accuracy/F1 | Reference |
|---|---|---|---|
| Pathology | Slide-level VQA, 20 closed items, CTIS-QA | 0.630 (avg, best model, CTIS-QA) | (Lu et al., 5 Jan 2026) |
| Dermatology VQA | DAS-multiple-choice, images, 27 questions | 0.798 (o3), 0.796 (GPT-4.1, ALL avg) | (Yim et al., 30 Dec 2025) |
| Medicine (Africa) | 4–5 option MCQ, 4039 items, AfriMed-QA | 0.793 (GPT-4o, zero-shot) | (Olatunji et al., 2024) |
| Clinical MCQA | CareQA-Close (MIR exams), 4 option MCQ | ~0.84 (Meta-Llama-3.1-70B-Instruct) | (Arias-Duart et al., 10 Feb 2025) |
| Conversational QA | Yes/no generation F1 (MultiCQAG, synthetic) | 74.6/72.2 (closed-ended F1, AC filtering) | (Hwang et al., 2022) |
| Science (Closed-book) | ARC MCQA, MetaQA framework | 64.2 % (best, 4-way MC) | (Zheng et al., 2020) |
| Open-domain | NQ, HotpotQA, TriviaQA (EM, O²-Searcher) | 0.44–0.60 (O²-Searcher, 3B model) | (Mei et al., 22 May 2025) |
- Model Scale and Domain Adaptation: Larger, generalist LLMs (GPT-4o, Llama3-405B) consistently outperform domain-specialized or smaller parameter-count models in MCQA (Olatunji et al., 2024, Arias-Duart et al., 10 Feb 2025). However, size is not a panacea: careful prompt design, template alignment, and localization remain essential.
- Domain-specific Challenges: Accuracy falls sharply when benchmarks are regionally diverse (AfriMed-QA), specialty-specific, or require rare/underrepresented category identification (CTIS-Bench, DermaVQA-DAS). Graphic-equivalent templates (CPRT, DAS) are crucial in vision-QA for reducing hallucination and focusing models on slide-grounded patterns (Lu et al., 5 Jan 2026, Yim et al., 30 Dec 2025).
- Synthetic vs. Human-Labeled Data: Synthetic closed-ended data can approach or rival human-annotated datasets when paired with effective filtering (e.g., hierarchical answerability classification). However, full equivalence is typically not attained; error modes frequently involve misaligned contexts, spurious answerable pairs, or rare class confusion (Hwang et al., 2022).
- Failure Modes: Common sources of error include catastrophic forgetting in parametric LMs (closed-book QA, BART), misclassification of rare or fine-grained classes, scale-induced overfitting or underfitting in meta-learned methods, and local-expertise bias or regional under-generalization in international benchmarks (Wang et al., 2021, Olatunji et al., 2024).
6. Limitations, Correlations with Open-Ended QA, and Best Practices
Closed-ended and open-ended QA serve complementary assessment roles; metrics such as MCQA accuracy and open-ended generation quality (summarization, multi-label, free generation) are weakly or negatively correlated except in specialized settings (e.g., note-taking). Therefore, comprehensive system evaluation in high-stakes domains should deploy both closed- and open-ended axes, cross-benchmark correlation analysis, and confidence measurement to identify blind spots (Arias-Duart et al., 10 Feb 2025).
Best practices for closed-ended QA evaluation and benchmark construction include:
- Curate multi-domain, multi-specialty closed-ended datasets to ensure coverage diversity;
- Report confidence intervals on accuracy, and conduct per-class or per-geography breakdowns to reveal hidden deficiencies;
- Employ multi-stage or template-driven data generation with explicit answerability filtering to maximize synthetic data quality;
- Tailor benchmarks to domain idiosyncrasies (regional content, low-resource constraints, rare findings);
- In vision-language QA, reinforce strict template alignment, balanced per-class sampling, and dual-path reasoning systems (Yim et al., 30 Dec 2025, Lu et al., 5 Jan 2026, Olatunji et al., 2024).
7. Open Challenges and Future Directions
Key open problems in closed-ended QA research include:
- Scalability of Memorization and Retrieval: As demonstrated in closed-book QA (Wang et al., 2021), parametric LMs struggle to reliably retrieve stored facts beyond a modest document set. Explicit retrieval procedures ("bridge tuning"), task segmentation, and memory-augmented models are candidates for overcoming these limitations.
- Robustness to Distribution Shift: Benchmarks such as AfriMed-QA highlight model vulnerability to region-specific content and style. Explicit localization, demographic balance, and rationalized MCQA curation are crucial for equitable QA systems (Olatunji et al., 2024).
- Vision-Language Integration: High-accuracy closed-ended VQA requires synergetic, multimodal fusion architectures with both global and local context capture (e.g., clustering-based and attention streams) and template-grounded outputs. Extending these systems to rare conditions, new imaging modalities, and multi-temporal clinical scenarios remains a frontier (Lu et al., 5 Jan 2026).
- Automated Data Quality Assurance: For synthetic data, advanced hierarchical answerability models and NLI-based filters are necessary to prevent error propagation in conversational and complex QA pipelines (Hwang et al., 2022).
- Complementarity with Open-Ended Methods: The complementary strengths of closed-ended and open-ended QA suggest hybrid systems and evaluation protocols for comprehensive, high-precision language understanding and reasoning (Arias-Duart et al., 10 Feb 2025, Mei et al., 22 May 2025).
Future work is likely to further integrate memory retrieval, multimodal reasoning, meta-learning, and template-driven frameworks to expand the reach, reliability, and domain adaptation of closed-ended QA systems across scientific, clinical, and open-domain applications.