Extractive QA Tasks Overview

Updated 30 May 2026

Extractive Question Answering tasks are defined by selecting a contiguous span from a context that precisely answers a given question.
Modern models use transformer architectures with start/end pointer techniques and dynamic query vectors to improve span selection and overcome low-data challenges.
These tasks are critical in domains like cross-lingual processing and biomedical extraction, with evaluations highlighting significant performance variances under diverse conditions.

Extractive Question Answering Tasks are a central class of natural language understanding problems requiring a model to identify and extract a contiguous text span from a provided context that directly answers a given question. This paradigm is foundational in reading comprehension, information extraction, and large-scale document processing, serving as a canonical evaluation ground for both neural LLMs and end-to-end QA pipelines. Extractive QA tasks prioritize answer provenance and interpretation, which has driven advances in span-finding algorithms, model architectures, cross-lingual evaluation, semi-supervised learning, and practical deployment at scale.

1. Task Definition and Formal Properties

Extractive Question Answering (Extractive QA) tasks are formally defined as follows: given an input context passage $T = (T_0, ..., T_{n-1})$ and a natural-language question $Q$ , the model must select a contiguous subspan $a = T_{i:i+j}$ of $T$ that best answers $Q$ (Castel et al., 2021). The core desiderata for an extractive QA system are:

Extractiveness: The predicted answer must be an exact span within the context, not a paraphrase or synthesis.
Exactness/Optimality: The selected span should maximize the probability $P(a | T, Q)$ under the model’s learned distribution.

In modern architectures, especially with encoder–only or encoder–decoder transformers, this is cast as producing probability distributions over possible start and end indices, or, in some cases, as explicit global scoring over all candidate spans (Lee et al., 2016, Luo et al., 2022). Evaluation is typically based on token-level or character-level Exact Match (EM) and F1 overlap between predicted and gold answer spans, with extensions for cross-lingual or biomedical cases as needed (Lewis et al., 2019, Yoon et al., 2021).

2. Model Architectures and Span Selection Methods

Classical Span Classifiers

The dominant approach encodes $[CLS]$ + question + $[SEP]$ + passage + $[SEP]$ through a transformer (e.g., BERT, RoBERTa) and applies two linear projections to the token representations to predict start and end scores via softmax (Pearce et al., 2021, Luo et al., 2022). The most likely span $(\hat{i}, \hat{j})$ is selected to maximize $Q$ 0, possibly with $Q$ 1 enforced.

Fixed vs Dynamic Queries

Vanilla models use learned, fixed start/end query vectors for extraction (the static-query paradigm). DyREx (Zaratiana et al., 2022) replaces these with input-dependent query vectors refined by a lightweight transformer decoder, allowing context-sensitive adaptation per example and significantly improving performance in low-data regimes.

Architecture	Span Selection Paradigm	Context Adaptivity
BERT, RoBERTa	Start/End linear heads	Low (fixed queries)
DyREx	Dynamic query vectors	High (input, interdependent)
RaSoR (BiLSTM)	Enumerate all spans	Medium (pre-span embedding)
SeqTag (BioEQA)	BIO sequence tagging	n/a (multi-span extraction)

Efficient span selection is achieved by limiting maximum answer length, sharing passage encodings across candidates (Lee et al., 2016), and, in generative models, defining decoding constraints (e.g., via “exact-extract” vs. greedy decoding) (Castel et al., 2021).

Multi-Span Extraction

Biomedical questions frequently require multi-span answers. Sequence tagging models assign BIO labels to each token, allowing extraction of arbitrary numbers of discontiguous spans without ad hoc rule-based postprocessing (Yoon et al., 2021). This contrasts with classical approaches that depend on start/end pointers and must threshold or enumerate candidate spans.

3. Decoding Algorithms, Training, and Optimization

Decoding Strategies

Greedy Decoding: Left-to-right autoregressive decoding, without global extractiveness constraint. Produces outputs not guaranteed to be exact spans of the context; often sufficient after limited fine-tuning (Castel et al., 2021).
Exact-Extract Algorithm: Dynamic programming to compute the most probable extractive span according to the autoregressive model’s full joint probability, requiring $Q$ 2 time (Castel et al., 2021). Dramatically outperforms greedy approaches in zero-shot and very low-shot regimes, but the difference narrows after 16–128 supervised examples.

Training Objectives

Training typically minimizes cross-entropy loss over the gold start and end indices or BIO sequence tags. Joint or auxiliary losses are frequently used:

Context prediction (BLANC): Auxiliary block-attention head trained to select the relevant context block, improving disambiguation in passages with multiple identical answer strings (Seonwoo et al., 2020).
Cloze Pretraining: Synthetic question-answer pairs generated from unlabeled documents substantially reduce label requirements (50%+ F1 on SQuAD achieved with $Q$ 31k labels) (Dhingra et al., 2018).
Multi-task Learning: Joint training on diverse datasets gives consistent improvements (+0.5–1 F1 for extractive, +8 F1 for generative heads) (Luo et al., 2022).

4. Extensions: Semi-Extractive, Cross-Lingual, and Large-Scale QA

Cross-Lingual and Multilingual QA

The MLQA benchmark (Lewis et al., 2019) formalizes extractive QA across seven languages, evaluating zero-shot (English-trained) and cross-lingual transfer. Results highlight a persistent gap (∼20–30 F1) between monolingual and zero-shot performance, especially for low-resource languages (Arabic, Hindi). Monolingual models with custom tokenizers outperform large multilingual models for Indic languages, showing the importance of vocabulary and morpho-syntactic alignment (Thirumala et al., 2022).

Semi-Extractive and Multi-Source QA

SEMQA tasks require blending quoted factual spans from multiple sources with freely generated connecting text (Schuster et al., 2023). Such “semi-extractive” answers combine the verifiability of extractive QA with the fluency of abstraction and are evaluated on fine-grained metrics capturing both extractive accuracy and overall coherence.

Large-Scale and Table-Filling QA

FabricQA-Extractor operationalizes extractive QA as a table population task at Wikipedia/biomedical scale by combining a standard passage retriever and span extractor with a “Relation Coherence” module. This module enforces schema-level consistency between extracted objects and their supposed subjects via forward–backward QA checks, yielding measurable performance gains over standard open QA pipelines (Wang et al., 2024).

5. Comparison of Extractive and Generative QA Models

Systematic studies reveal distinctive trade-offs between extractive and generative QA readers, even when both use similar transformer backbones (Luo et al., 2022, Mallick et al., 2023, Xu et al., 2021):

Short Contexts and Domain Shift: Extractive readers consistently outperform generative readers on short passages ( $Q$ 4600 tokens) and show superior generalization to out-of-domain (OOD) benchmarks.
Long Contexts: Generative models (full encoder–decoder) can excel on very long or document-level contexts, though at significant computational cost and risk of non-extractive (hallucinated) answers.
Rare Tokens: Extractive decoders robustly copy rare and OOV answer spans verbatim, avoiding Unicode corruption or $Q$ 5 emissions common in generative models.

Generative models can be adapted for extractive use by generating answer indices or leveraging internal attention maps to locate probable span boundaries, often with competitive or state-of-the-art performance on multi-span, long-form, or sentence-level benchmarks (Mallick et al., 2023, Xu et al., 2021).

6. Practical Considerations and Applications

Key deployment and research considerations for extractive QA systems include:

Data Scarcity: Cloze-pretraining, synthetic QA generation with explainable models (XAIQA uses classifier attributions), and self-supervised learning via span-copying objectives accelerate convergence and enable strong results in low-resource settings (Dhingra et al., 2018, Stremmel et al., 2023).
Multi-Occurrence and Disambiguation: Auxiliary context-block prediction and joint loss formulations (e.g., BLANC) mitigate common errors when answer strings occur multiple times (Seonwoo et al., 2020).
Biomedical and Multi-Span Domains: Sequence tagging (BIO labeling) outperforms classical single-span pointer architectures, especially for domains demanding recall of variable-length answer lists (Yoon et al., 2021).
Scalability: Industrial QA architectures integrate fast retrieval (BM25 or dense), efficient extractive readers, followed by ranking and coherence modules to sustain throughput (1–2 Q/s at $Q$ 6 passages) (Wang et al., 2024).
Summarization and Hierarchical QA: Block-level extractive summarizers, with iterative selection and explicit question-prefixing, handle document-level and relationship-centric questions in structured data settings (Gu et al., 2023).

In summary, extractive question answering is characterized by its formal span selection framework, diversity of span-finding model architectures, robust decoding strategies, and growing generalization to cross-lingual, multi-document, and large-scale settings. Core challenges persist in multi-span extraction, domain adaptation, and hybrid extractive-generative reasoning, but methodological advances (context-adaptive queries, auxiliary context supervision, schema-level coherence modeling) continue to drive the field forward.