Open Domain Question Answering

Updated 5 February 2026

Open-domain Question Answering is a natural language processing task that retrieves relevant passages from large corpora and extracts or generates accurate answers.
It employs a dual-module design with a retriever to select documents and a reader to comprehend and extract precise responses from the retrieved evidence.
Recent advances include hybrid dense-sparse retrieval, query and context augmentation, and integration with large language models to boost both accuracy and efficiency.

Open-domain Question Answering (OpenQA) is a core research task in natural language processing that aims to generate accurate, natural language answers to factoid or complex questions posed in free text, based on information from a large, unstructured document collection (e.g., Wikipedia). Unlike closed-book or single-document QA, OpenQA systems must first retrieve relevant knowledge from corpora spanning millions of passages and then comprehend this evidence to pinpoint or generate the desired answer. Recent years have witnessed substantial advances in both the effectiveness and efficiency of OpenQA pipelines, including innovations in retrieval, reading, knowledge integration, and robustness analysis across a variety of domains and languages.

1. Foundations and Problem Formulation

OpenQA is formally defined as follows: given a natural language question $q$ and a large-scale unstructured corpus $C$ (typically consisting of millions of documents or passages), the system returns an answer $a$ , often with a supporting evidence span. The search space is orders of magnitude greater than in closed-domain QA, requiring two tightly coupled modules:

Retriever: Selects a manageable subset of passages $D \subset C$ that are likely to contain the answer.
Reader: Performs machine reading comprehension (MRC) over $D$ to extract (extractive), generate (generative), or select (multiple-choice) the answer $a$ .

The performance bottleneck is principally retrieval: if no retrieved passage contains the answer, downstream comprehension is moot (Mao et al., 2020). Let $q$ be the question, $d$ a candidate passage, and $S(q,d)$ a similarity score; retrieval accuracy directly constraints overall EM/F1.

Two main retrieval paradigms exist:

Sparse retrieval (BM25, TF–IDF): $S_\text{BM25}(q,d)=\sum_{t \in q}\text{IDF}(t)\cdot \frac{(k_1+1)f_{t,d}}{f_{t,d}+k_1(1-b+b\cdot|d|/\mathrm{avgdl})}$
Dense retrieval (DPR): Learns encoders $C$ 0, $C$ 1 such that $C$ 2

Hybrid strategies further combine these signals to boost recall (Zhang et al., 2022, Levy et al., 2021).

2. Retriever–Reader Architectures and Advances

The modern OpenQA architecture is predominantly the Retriever–Reader (R–R) pipeline. The retriever uses either sparse bag-of-words, learned dense encodings, or their fusion to minimize the candidate set. The downstream reader, typically an extractive span predictor (e.g., BERT, ELECTRA) (Semnani et al., 2020, Wu et al., 2021), a generative model (e.g., T5, BART), or a fusion network (Ma et al., 2021), processes the retrieved passages to output the final answer.

Recent advances include:

Dense-Sparse Fusion: DenSPI's hybrid index yields $C$ 3 CPU speedups with competitive accuracy by pre-indexing both phrase-level dense and sparse features for all possible spans over Wikipedia (Seo et al., 2019).
Query and Context Augmentation: Generation-Augmented Retrieval (GAR) enriches questions with generative LM-produced context (answer, sentence, title) to expose retrieval targets that may be lexically divergent, with fused retrieval lists providing consistent +3–5% recall gains (Mao et al., 2020).
Multi-hop and Knowledge Fusion: Multi-step retrieval pipelines chain fact retrieval (symmetric-difference queries), BERT-based semantic knowledge re-ranking, and fusion networks for multi-hop QA (OpenBookQA, QASC) (Banerjee et al., 2020).
Multi-format Evidence: UDT-QA unifies structured (tables, KB subgraphs) and unstructured text via a data-to-text “verbalizer,” enabling retrieval and reading over a joint expanded index (Ma et al., 2021).

3. Efficiency, Scalability, and Practical Constraints

Scaling OpenQA to practical deployments demands trade-offs across memory, speed, and answer accuracy.

Key techniques and findings:

Index Compression: Product quantization (PQ), binary hashing (BPR), and dimensionality reduction decrease index memory by 20–65x for dense retrieval at marginal loss (≤5% EM) (Zhang et al., 2022, Yang et al., 2021).
Model Compression: Structure pruning (MobileBERT) and knowledge distillation shrink model checkpoint size ( $C$ 40.5 GB) at a cost of 5–10 EM points unless optimized (Zhang et al., 2022).
Unified Architectures: Parameter-sharing (retriever–reader unification) and minimal system design (all components in MobileBERT, INT8 embedding) demonstrate viability on sub-500 MB edge hardware with only ~2–4% absolute EM loss (Yang et al., 2021).
Hybrid Dense–Sparse Pipelines: Hybrid pipelines, as in Mindstone (BM25 + BERT-ranker + BERT-reader), can be tuned for sub-second latency and directly leverage low-resolution (click-derived) supervision (Semnani et al., 2020).
Retriever-only Models: Phrase- or QA-pair retrievers (DensePhrases, RePAQ) directly index answer candidates, achieving high throughput with increased index requirements (Zhang et al., 2022).
Deployment in Non-English and Low-Resource Contexts: Machine translation–based weak supervision (SQuAD-TR for Turkish), late-interaction dense retrievers (ColBERT-QA), and local Wikipedia indexing enable competitive OpenQA in typologically distant, resource-scarce languages (Budur et al., 2024).

Quantitative trade-offs show that R–R models yield the best EM but at significant memory/latency cost; lightweight retriever-only and generator-only approaches improve efficiency at notable accuracy loss (Zhang et al., 2022).

4. Knowledge Integration and Structured Reasoning

Increasing OpenQA's coverage and robustness increasingly requires integration of structured and semi-structured sources:

Hybrid KBQA–Text QA: OpenQA systems achieve higher accuracy by integrating structured KBQA (semantic parsing, entity linking, neural relation matching) and unstructured neural MRC over text (Wu et al., 2021).
Data-to-Text Verbalization: Unifies retrieval over text, tables, and knowledge bases, with empirical gains on Natural Questions and WebQuestions; verbalized knowledge is preferred downstream (Ma et al., 2021).
Knowledge-Aided Retrieval and Reranking: KAQA uses question–document and document–document graphs built from external KBs (WordNet, Freebase, ConceptNet) for both retrieval and answer reranking, yielding +5 F1 over strong BERT baselines (Zhou et al., 2020).
Web Tables as Evidence: Adaptation of deep semantic similarity models and table-quality features supports high-precision table-based answering for both factoid and non-factoid queries, deployed at Web scale (Chakrabarti et al., 2020).

Multi-hop pipelines, structured knowledge selection, and answer fusion layers are critical in science, medical, and emergent domains requiring explicit reasoning or comprehensive evidence assembly (Banerjee et al., 2020, Jin et al., 2020, Levy et al., 2021).

5. Robustness, Generalization, and Special Domains

Evaluating and improving OpenQA robustness is a growing focus:

Contrast Consistency: Dense retrievers (e.g., DPR) fail on minimally edited questions (MEQs) that change the answer, retrieving the same passages ∼70% of the time as originals. Query-side contrastive loss discriminates between paraphrases and answer-changing edits, improving MEQ EM by +2–3 points without degrading standard benchmarks (Zhang et al., 2023).
Conversational OpenQA: Multifaceted improvements include KL-regularized retriever training, lightweight post-rankers, and curriculum learning to bridge the gold-passage train/inference gap, with substantial F1 gains on OR-QuAC (Liang et al., 2022).
Low-Resource and Emergent Domains: Modular pipelines with domain-adapted retrievers/readers, explicit diversity, and transfer learning (PubMedBERT, BioBERT) facilitate rapid deployment for pandemic and low-resource settings with minimal supervision (Levy et al., 2021, Budur et al., 2024).
Temporal and Archival QA: Construction of large-scale, temporally aware QA benchmarks such as ArchivalQA over historical news enables evaluation of ODQA pipelines under vocabulary drift, relative time expressions, and diachronic knowledge gaps (Wang et al., 2021).

Evaluation on noisy, ambiguous, or adversarial sets reveals design flaws and prevents overfitting to standard benchmarks. Approaches such as reciprocal rank fusion, answer-span aggregation, and cross-answer reasoning further improve robustness in realistic deployments.

6. Enhancing OpenQA with LLMs and Generation

With the emergence of LLMs, OpenQA explores new paradigms for knowledge integration and answer controllability:

Retriever-Augmented Generation: Frameworks such as GenKI retrieve relevant passages via DPR, fine-tune LLMs (LLM-KB, GLM-6B, LLaMA-65B) to absorb retrieved knowledge, and apply controllable generation and consistency scoring to adapt answer format. This yields strong performance on TriviaQA, MSMARCO, and CMRC2018, with near-linear dependence between knowledge recall and answer EM (Shen et al., 26 May 2025).
Closed-Book Context Generation: Context generation via prompting (CGAP) directs closed-book LMs at context production prior to answer prediction. Marginalizing over multiple sampled contexts, this approach can achieve or surpass open-book EM, particularly with 530B-parameter models (Su et al., 2022).
Fusion Architecture Trends: Layering generative augmentation, hybrid retrieval, and deep answer fusion leads OpenQA pipelines toward architectures that fuse generation, retrieval, reranking, and reasoning in a tightly coupled, end-to-end optimizable framework (Mao et al., 2020, Shen et al., 26 May 2025).

7. Datasets, Evaluation Metrics, and Open Challenges

OpenQA benchmarking relies on datasets ranging from general (Natural Questions, TriviaQA, SQuAD-Open) to specialized (COVID-QA, MedQA, ArchivalQA) and multilingual (SQuAD-TR, XQuAD-TR). Core metrics include Exact Match (EM), F1 (token-level overlap), retrieval recall (R@k), and passage MRR. Fuzzy match metrics and context marginalization are adopted for low-resource and non-standard answer contexts (Levy et al., 2021, Budur et al., 2024).

Open challenges highlighted in recent work include:

Deployment on restricted hardware (mobile, edge), necessitating radical improvements in index and model size (Zhang et al., 2022, Yang et al., 2021).
Better joint optimization of retriever–reader pipelines, especially under memory and latency constraints (Zhang et al., 2022, Semnani et al., 2020).
Robustness to subtle question perturbations, adversarial edits, and ambiguous queries (Zhang et al., 2023).
Continual index/model adaptation as background knowledge evolves over time (Zhang et al., 2022, Wang et al., 2021).
Multilingual, multimodal, and multimodal–structured retrieval at scale (Ma et al., 2021, Budur et al., 2024).
Explainability and evidence chain return for domains where rationale is safety-critical (e.g., medical QA) (Jin et al., 2020, Zhou et al., 2020).

Continued innovation is required in retrieval augmentation, knowledge integration, compression/distillation, and domain adaptation to address these deficits and make OpenQA systems reliable, interpretable, and broadly deployable.