Closed-Book Question Answering

Updated 5 November 2025

Closed-book QA is defined as answering questions solely by using a model’s encoded parameters without any external context, emphasizing efficiency and rapid, self-contained response.
Recent methodological advances, such as strategic masking and self-assessment mechanisms, significantly improve factual recall and mitigate hallucinations.
Evaluation relies on metrics like exact match and token F1, exposing challenges in reasoning, knowledge coverage, and scalability across diverse domains.

Closed-book question answering (QA) is the task of answering open-domain or domain-specific questions solely by leveraging information encoded within a model’s parameters, without consulting any external context, documents, or retrieval systems at inference time. This paradigm contrasts with open-book QA, where models augment their internal knowledge with real-time evidence retrieval. Recent advances in LLMs and text-to-text architectures have intensified research into closed-book QA as both a scientific probe of knowledge representation and a practical approach for efficient, low-latency inference, but substantial challenges remain in knowledge coverage, reasoning, and robustness.

1. Definition and Operational Modes

Closed-book QA requires a model to output answers $a$ given questions $q$ directly from its parameters $\theta$ , such that: $P(a|q; \theta) = \prod_{i=1}^{|a|} P(a_i | q, a_{<i}; \theta)$ No retrieval or context input is permitted at inference time; the model relies entirely on its learned, parameterized knowledge. Key evaluation settings include:

Zero-shot closed-book: No task-specific fine-tuning; relies on general pre-training.
Few-shot or instruction-based closed-book: Relies on prompt engineering or few-shot in-context examples, without gradient updates.
Fine-tuned closed-book: Model is fine-tuned with $(q,a)$ pairs on QA supervision.

Distinct from closed-book QA are retrieval-augmented or open-book methods, where external resources are consultable at runtime.

2. Knowledge Encoding and Model Architectures

Empirical studies have shown that increasing model scale exponentially increases factual recall and closed-book QA performance. For example, T5-11B achieves 42.3% exact match (EM) on TriviaQA, improved to 51.0% with salient span masking (SSM) pre-training; these figures approach open-book Dense Passage Retrieval systems (57.9% EM on TriviaQA) (Roberts et al., 2020). SSM and its successors (learned masking policies) further enhance memory by masking answer-like spans during intermediate pre-training, “packing” relevant knowledge into parameters (Ye et al., 2020).

Architectures for closed-book QA typically use encoder-decoder (T5), encoder-only (BERT, RoBERTa), or transformer-based LLMs. Some approaches integrate differentiable knowledge graph modules mid-stack (e.g., OREO-LM’s Knowledge Interaction Layer), which allow multi-hop symbolic reasoning in a closed-book setting if KG parameters are differentiably fused into the model (Hu et al., 2022).

3. Methodological Advances

Several methods advance the state of closed-book QA:

Type-driven answer re-ranking: Surfaces implicit answer type information from text-to-text outputs and filters/reranks candidates using knowledge graph entity types ( $instance\_of$ /P31 in Wikidata). The ACT Selection framework increases T5-Large’s SQWD Hit@1 from 23.66% (base) to 47.42%, approaching SOTA retrieval-based KGQA even under zero-shot transfer (Salnikov et al., 2023). The final candidate score combines type, neighbor, model confidence, and property similarity terms:

$S_{final} = S_{type} + S_{neighbor} + S_{t2t} + S_{property}$

Strategic masking policies for pre-training: Intermediate pre-training with learned (task-driven) masking of answer-like spans yields higher downstream closed-book accuracy than random or heuristic-based masking. This approach typically deploys a lightweight BiLSTM to select spans for masking, reinforcing the memory of question-worthy facts (Ye et al., 2020).
Self-estimation and abstention: Models augmented with hallucination masking mechanisms are trained to self-assess knowledge boundaries, outputting a special $\langle search \rangle$ token in cases of likely hallucination. Parameter-efficient adapters (LoRA) allow LLMs to abstain on $77.2\%$ of "unknown" queries while answering $78.2\%$ of knowns, reducing expensive retrieval-based lookups (Erbacher et al., 3 Jan 2024).
Context generation and marginalization: Generative approaches that prompt an LLM to create supporting context fragments from its own parameters prior to answering (CGAP), followed by marginalization across generated contexts, can match or surpass open-book baselines (68.6% EM on TriviaQA vs. 68.0% for open-book) with zero fine-tuning (Su et al., 2022).
Self-contextualizing QA and answer boosting: Tree-Search methods sample diverse, plausible sequences from the model, aggregate them as context, and then re-prompt for a refined answer. This increases informativeness, accuracy, and especially robustness for smaller or fine-tuned LLMs (Kokaia et al., 2023).

4. Limitations and Failure Modes

Closed-book QA faces several acute challenges:

Memorization vs. Generalization: Analyses reveal that models such as BART act primarily as memorizers. On benchmarks with high question/answer overlap (e.g., TriviaQA: 71.7% answer overlap), EM exceeds 65% on overlapping test questions but drops to near-zero when both questions and answers are novel (Lewis et al., 2020).
Parametric Knowledge Bottleneck: With datasets engineered for low overlap (e.g., SQuAD-derived), BART’s closed-book QA accuracy collapses from ~25% to 1.5%. Model capacity for precise fact memorization scales sublinearly with pretraining corpus size (Wang et al., 2021).
Reasoning Deficiency: Sequence-to-sequence LMs trained only on single-hop questions lack emergent multi-hop compositionality. Performance on multi-hop closed-book QA degrades sharply (e.g., 33.25% EM for UnifiedQA, even when both hops are individually answerable) (Jiang et al., 2022).
Domain and Linguistic Scarcity: In low-resource languages such as Kazakh, top LLMs (ChatGPT v3.5) achieve recall $\geq 0.5$ on less than 9% of test cases and exhibit numerous hallucinations, due to limited representation in pre-training data (Yeshpanov et al., 6 Apr 2024).
Domain-specific Transfer Failure: Focused pre-training on domain texts, such as textbooks, yields only marginal gains in downstream closed-book QA (e.g., 56% accuracy vs. 50% expected by random guessing; open-book settings rise to 74%), exposing shallow or brittle knowledge absorption (Ciosici et al., 2021).

5. Evaluation and Benchmarking Practices

Closed-book QA is most commonly evaluated using exact match (EM) and token-level F1, with additional recall- or substring-based metrics for morphologically rich languages. In long-form or multifaceted QA, metrics such as ROUGE-L, Disambig-F1 (facet extraction accuracy), and QAEvalTheme are used (Amplayo et al., 2022). Human evaluation remains critical, as automated metrics poorly capture paraphrase, reasoning, or multiple valid answer representations—particularly for generative, long-form, and non-English outputs (Peinl et al., 2023).

Crucially, empirical studies recommend stratifying benchmarks by question and answer overlap levels and caution that single, aggregate EM scores dramatically overstate model generalization. For instance, BART achieves 71.5% EM on question-overlapping WebQuestions, but only 1.6% EM on non-overlapping cases (Lewis et al., 2020). Custom or compositional benchmarks (e.g., Complex WebQuestions, ASQA, WikiCQA) highlight weaknesses otherwise masked in standard test sets.

6. Integration of Symbolic Knowledge and Reasoning

Hybrid models incorporating symbolic knowledge can materially enhance closed-book QA. Notable methods include:

Knowledge Graph-guided type filtering and re-ranking: Improve candidate selection for text-to-text LMs using Wikidata's "instance_of" types; demonstrated to double Hit@1 accuracy on some benchmarks (Salnikov et al., 2023).
Differentiable KG reasoning modules (OREO-LM): Embed multi-hop KG reasoning within the transformer stack, supporting probabilistic walks over entity/relation graphs and yielding both empirical gains and output reasoning paths for interpretability. Gains are particularly large for problems requiring inference over incomplete or missing KG edges (Hu et al., 2022).

Such models suggest that, even when constrained by closed-book protocols (no retrieval at inference), strategic use of symbolic priors and knowledge structures can supplement or amplify the factual capacity of LLMs.

7. Directions for Robustness, Multilinguality, and Future Progress

Recent results indicate the following points of emphasis for advancing closed-book QA:

Instruction tuning and RLHF are more decisive than scale alone. Well-tuned MLMs (e.g., Airoboros 33B) matched or surpassed closed ChatGPT (175B) in zero-shot generative QA, and pooling best answers from multiple models covered 91.8% of tested questions (Peinl et al., 2023).
Ensembling and answer diversity are underexploited. Combining answers across complementary models or generations yields substantial accuracy boosts.
Aspect-based feedback and facetwise supervision are crucial for long-form QA, especially to handle multifaceted or ambiguous queries (Amplayo et al., 2022).
Multilingual and domain adaptation remain open challenges. Current LLMs perform poorly on low-resource languages without substantial corpus representation; monolingual models underperform even further (Yeshpanov et al., 6 Apr 2024).

A plausible implication is that future research should focus on more sample-efficient knowledge encoding, cross-lingual and domain transfer mechanisms, and systematic approaches to compositional and multi-hop reasoning, as well as the careful construction of benchmarks that expose memorization and reasoning bottlenecks. Integrative symbolic/neural architectures and robust uncertainty/self-assessment frameworks also present promising avenues for real-world, scalable closed-book QA systems.