Closed-Book Question Answering

Updated 27 September 2025

Closed-Book Question Answering is a paradigm where neural models generate answers solely from internalized, pre-trained knowledge without external retrieval.
Recent CBQA systems use large transformer architectures with unsupervised pre-training and supervised fine-tuning to implicitly encode factual information.
Scaling model size and integrating techniques like knowledge graph augmentation offer promising improvements in factual accuracy and interpretability.

Closed-Book Question Answering (CBQA) is a research paradigm in natural language processing where systems answer open-domain or specialized questions without retrieving or consulting any external documents or knowledge sources at inference time. Instead, all potentially relevant information is assumed to be stored implicitly within the parameters of a pre-trained or fine-tuned neural LLM. CBQA contrasts with open-book approaches, which employ explicit retrieval mechanisms to supplement the answer generation process with external textual evidence.

1. Key Principles and Formalization

The core of CBQA is the reliance on the internalized knowledge representations learned during large-scale, typically unsupervised, pre-training and subsequent supervised fine-tuning. Standard CBQA formalizes the input–output mapping as:

$x \to y$ ,

where $x$ is a natural language question and $y = (y_1, y_2, ..., y_T)$ is the answer sequence. The model parameters $\theta$ are learned by maximizing the conditional likelihood:

$L(\theta) = -\sum_t \log P(y_t \mid y_{1}, ..., y_{t-1}, x; \theta)$

This maximum likelihood estimation (MLE) objective compels the model to retrieve and synthesize domain knowledge stored in $\theta$ without recourse to supporting context or evidence at test time (Roberts et al., 2020).

2. Model Architectures and Training Paradigms

Modern CBQA systems overwhelmingly employ large encoder-decoder Transformer architectures (e.g., T5, BART), building on advances in pre-training and transfer learning. Two principal training phases are typically observed:

Unsupervised Pre-training: Models are initially trained using denoising objectives such as span corruption (masking contiguous input spans and reconstructing them), with salient span masking (SSM) targeting specifically knowledge-rich spans (e.g., named entities or dates).

Given an input $x = (x_1, ..., x_n)$ , selected spans $x_{i:j}$ are replaced by sentinel tokens (e.g., <M>), and the task is to predict the missing content.

Supervised Fine-tuning for QA: The pre-trained models are further fine-tuned exclusively on $(question, answer)$ pairs (CBQA data), with the answer expected to be generated entirely from model parameters. AdaFactor optimization and large effective batch sizes are typical; minimal or no additional hyperparameter tuning is performed (Roberts et al., 2020).

Meta-learning approaches (e.g., MetaQA) substitute standard fine-tuning with model-agnostic meta-learning (MAML), enabling rapid adaptation to new question types without external retrieval (Zheng et al., 2020). The learning objective is:

$\min_{\theta} \sum_{\tau} L(D'_{\tau}, T(D_{\tau}, \theta))$

where $D_{\tau}$ is a task support set and $D'_{\tau}$ a query set; $T$ denotes fast adaptation.

3. Knowledge Storage, Scalability, and Limitations

A central question in CBQA research is quantifying and maximizing the amount of factual knowledge storable in model parameters. Empirical results demonstrate that accuracy in closed-book QA scales positively with model size: moving from T5-Base ( $\sim$ 220$M params) to T5-11B dramatically improves performance on NaturalQuestions, TriviaQA, and WebQuestions (Roberts et al., 2020). Models using SSM outperform non-SSM counterparts at every scale.

Despite this scaling law, several critical limitations persist:

Challenge	Description	Significance
Model Size & Cost	State-of-the-art CBQA demands multi-billion parameter models	Limits accessibility
Interpretability	Knowledge is stored implicitly, not traceable	Obstructs verification
Hallucination	Models may "invent" plausible answers	Reduces trust
Evaluation Metrics	String match metrics often underestimate human-level correctness	Requires robust metrics

Maximum likelihood training does not ensure reliable fact memorization, and robust evaluation requires not only exact match but also human assessment for nuanced answer validity.

4. Knowledge Graphs and External Fact Augmentation

Pure CBQA stores knowledge in weights, but recent work explores bridging implicit and structured sources. Knowledge-augmented prompting (KAPING) injects knowledge graph triples directly into prompts, supporting zero-shot answers that fuse model-internal and retrieved factual signals. Retrieved facts are selected and ranked for relevance using embedding similarity, then prepended to the question in triple- or free-form text (Baek et al., 2023). This approach can substantially reduce hallucinations and improve factuality, especially for less popular or multi-hop questions.

Other techniques—such as Answer Candidate Type selection—filter answer candidates by their knowledge graph entity types, improving accuracy for questions involving rare entities (Salnikov et al., 2023). Differentiable knowledge graph reasoning modules can be injected into Transformer architectures (e.g., OREO-LM), enabling multi-hop relational inference and yielding interpretable reasoning paths (Hu et al., 2022).

5. Practical and Application-Specific Advances

CBQA underpins diverse applications, from trivia and open-domain QA benchmarks to specialized science exams and long-form multi-facet answers. MetaQA demonstrates that meta-learning over knowledge-point tasks and leveraging contextual signals from labeled examples can surpass retrieval-based approaches for complex exam-style reasoning (Zheng et al., 2020). Task-specific masking strategies—learning which spans are salient to future QA tasks—improve knowledge retention and initialization for fine-tuning (Ye et al., 2020).

Prompting strategies now include query refinement, where LLMs generate facets or sub-questions prior to answering, resolving ambiguity and promoting comprehensive long-form responses without external support (Amplayo et al., 2022). Context generation frameworks generate and marginalize over candidate contexts sourced from the LM itself, mimicking a retriever-reader pipeline inside the model and achieving accuracy on par with open-book QA (Su et al., 2022, Kokaia et al., 2023).

Medium-large models (6B–100B parameters) can approach or even match the performance of commercial LLMs like ChatGPT (score of 82.7% vs. 60.9% for ChatGPT when using best-of-ensemble aggregation), provided instruction fine-tuning and dataset coverage are sufficient (Peinl et al., 2023). These results nuance the scaling narrative: training data quality and feedback granularity can be as important as brute model size.

6. Evaluation, Generalization, and Predictive Metrics

CBQA evaluation must address not only memorization but also true generalization (Lewis et al., 2020, Wang et al., 2021). Substantial test-train overlap in current datasets can obscure generalization deficits: performance may drop by over 63% when moving from memorized to genuinely novel test questions. Nearest-neighbor retrieval baselines can outperform sophisticated closed-book models purely by exploiting data overlap.

The Size-dependent Mutual Information (SMI) metric quantifies a model's likely performance on CBQA tasks using only pre-training signal and model size, without additional training (Jiang et al., 6 Feb 2025). SMI integrates mutual information between subject–object pairs in the pre-training corpus with a scaling factor for model parameters:

$SMI(s, o, \Phi) = [Norm(\log(I(s, o)))]^{1 + 1/\Phi}$

where $I(s, o)$ is the normalized mutual information, $Norm$ is a normalization function, and $\Phi$ is model size in billions of parameters. SMI achieves $R^2$ values exceeding $0.84$ in predicting CBQA accuracy across models from $1.1$B to $13$B parameters—demonstrating that it is possible to anticipate performance from corpus characteristics and model scale alone.

Multi-template QA evaluation, involving many paraphrased variants for each question, is recommended for robust measurement of factual recall, as single-template or single-metric approaches are unstable.

7. Future Directions

Research suggests several promising directions for CBQA systems:

Parameter efficiency: Development of smaller models or alternative architectures that approach the effectiveness of current multi-billion parameter systems.
Interpretable reasoning: Integrating explicit knowledge graph reasoning or rationale generation.
Adaptive external querying: Endowing models with learned self-assessment, making API calls or searches only when internal confidence is low (Erbacher et al., 3 Jan 2024).
Data and metric refinement: Curating pre-training datasets based on predicted SMI impact and adopting multi-faceted evaluation protocols to measure diverse reasoning and generalization capabilities (Jiang et al., 6 Feb 2025, Ciosici et al., 2021).
Hybrid and self-contextualizing strategies: Merging internal LM knowledge with dynamically generated or retrieved context while controlling computation and verifying answer quality (Kokaia et al., 2023, Su et al., 2022).

Systematic integration of these advances is expected to mitigate hallucination, improve factual robustness, and provide transparent, efficient, and scalable solutions to challenging knowledge-intensive question answering tasks.