Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 54 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 127 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Closed-Book Question Answering

Updated 27 September 2025
  • Closed-Book Question Answering is a paradigm where neural models generate answers solely from internalized, pre-trained knowledge without external retrieval.
  • Recent CBQA systems use large transformer architectures with unsupervised pre-training and supervised fine-tuning to implicitly encode factual information.
  • Scaling model size and integrating techniques like knowledge graph augmentation offer promising improvements in factual accuracy and interpretability.

Closed-Book Question Answering (CBQA) is a research paradigm in natural language processing where systems answer open-domain or specialized questions without retrieving or consulting any external documents or knowledge sources at inference time. Instead, all potentially relevant information is assumed to be stored implicitly within the parameters of a pre-trained or fine-tuned neural LLM. CBQA contrasts with open-book approaches, which employ explicit retrieval mechanisms to supplement the answer generation process with external textual evidence.

1. Key Principles and Formalization

The core of CBQA is the reliance on the internalized knowledge representations learned during large-scale, typically unsupervised, pre-training and subsequent supervised fine-tuning. Standard CBQA formalizes the input–output mapping as:

xyx \to y,

where xx is a natural language question and y=(y1,y2,...,yT)y = (y_1, y_2, ..., y_T) is the answer sequence. The model parameters θ\theta are learned by maximizing the conditional likelihood:

L(θ)=tlogP(yty1,...,yt1,x;θ)L(\theta) = -\sum_t \log P(y_t \mid y_{1}, ..., y_{t-1}, x; \theta)

This maximum likelihood estimation (MLE) objective compels the model to retrieve and synthesize domain knowledge stored in θ\theta without recourse to supporting context or evidence at test time (Roberts et al., 2020).

2. Model Architectures and Training Paradigms

Modern CBQA systems overwhelmingly employ large encoder-decoder Transformer architectures (e.g., T5, BART), building on advances in pre-training and transfer learning. Two principal training phases are typically observed:

  • Unsupervised Pre-training: Models are initially trained using denoising objectives such as span corruption (masking contiguous input spans and reconstructing them), with salient span masking (SSM) targeting specifically knowledge-rich spans (e.g., named entities or dates).

Given an input x=(x1,...,xn)x = (x_1, ..., x_n), selected spans xi:jx_{i:j} are replaced by sentinel tokens (e.g., <M>), and the task is to predict the missing content.

  • Supervised Fine-tuning for QA: The pre-trained models are further fine-tuned exclusively on (question,answer)(question, answer) pairs (CBQA data), with the answer expected to be generated entirely from model parameters. AdaFactor optimization and large effective batch sizes are typical; minimal or no additional hyperparameter tuning is performed (Roberts et al., 2020).

Meta-learning approaches (e.g., MetaQA) substitute standard fine-tuning with model-agnostic meta-learning (MAML), enabling rapid adaptation to new question types without external retrieval (Zheng et al., 2020). The learning objective is:

minθτL(Dτ,T(Dτ,θ))\min_{\theta} \sum_{\tau} L(D'_{\tau}, T(D_{\tau}, \theta))

where DτD_{\tau} is a task support set and DτD'_{\tau} a query set; TT denotes fast adaptation.

3. Knowledge Storage, Scalability, and Limitations

A central question in CBQA research is quantifying and maximizing the amount of factual knowledge storable in model parameters. Empirical results demonstrate that accuracy in closed-book QA scales positively with model size: moving from T5-Base (\sim220$M params) to T5-11B dramatically improves performance on NaturalQuestions, TriviaQA, and WebQuestions (Roberts et al., 2020). Models using SSM outperform non-SSM counterparts at every scale.

Despite this scaling law, several critical limitations persist:

Challenge Description Significance
Model Size & Cost State-of-the-art CBQA demands multi-billion parameter models Limits accessibility
Interpretability Knowledge is stored implicitly, not traceable Obstructs verification
Hallucination Models may "invent" plausible answers Reduces trust
Evaluation Metrics String match metrics often underestimate human-level correctness Requires robust metrics

Maximum likelihood training does not ensure reliable fact memorization, and robust evaluation requires not only exact match but also human assessment for nuanced answer validity.

4. Knowledge Graphs and External Fact Augmentation

Pure CBQA stores knowledge in weights, but recent work explores bridging implicit and structured sources. Knowledge-augmented prompting (KAPING) injects knowledge graph triples directly into prompts, supporting zero-shot answers that fuse model-internal and retrieved factual signals. Retrieved facts are selected and ranked for relevance using embedding similarity, then prepended to the question in triple- or free-form text (Baek et al., 2023). This approach can substantially reduce hallucinations and improve factuality, especially for less popular or multi-hop questions.

Other techniques—such as Answer Candidate Type selection—filter answer candidates by their knowledge graph entity types, improving accuracy for questions involving rare entities (Salnikov et al., 2023). Differentiable knowledge graph reasoning modules can be injected into Transformer architectures (e.g., OREO-LM), enabling multi-hop relational inference and yielding interpretable reasoning paths (Hu et al., 2022).

5. Practical and Application-Specific Advances

CBQA underpins diverse applications, from trivia and open-domain QA benchmarks to specialized science exams and long-form multi-facet answers. MetaQA demonstrates that meta-learning over knowledge-point tasks and leveraging contextual signals from labeled examples can surpass retrieval-based approaches for complex exam-style reasoning (Zheng et al., 2020). Task-specific masking strategies—learning which spans are salient to future QA tasks—improve knowledge retention and initialization for fine-tuning (Ye et al., 2020).

Prompting strategies now include query refinement, where LLMs generate facets or sub-questions prior to answering, resolving ambiguity and promoting comprehensive long-form responses without external support (Amplayo et al., 2022). Context generation frameworks generate and marginalize over candidate contexts sourced from the LM itself, mimicking a retriever-reader pipeline inside the model and achieving accuracy on par with open-book QA (Su et al., 2022, Kokaia et al., 2023).

Medium-large models (6B–100B parameters) can approach or even match the performance of commercial LLMs like ChatGPT (score of 82.7% vs. 60.9% for ChatGPT when using best-of-ensemble aggregation), provided instruction fine-tuning and dataset coverage are sufficient (Peinl et al., 2023). These results nuance the scaling narrative: training data quality and feedback granularity can be as important as brute model size.

6. Evaluation, Generalization, and Predictive Metrics

CBQA evaluation must address not only memorization but also true generalization (Lewis et al., 2020, Wang et al., 2021). Substantial test-train overlap in current datasets can obscure generalization deficits: performance may drop by over 63% when moving from memorized to genuinely novel test questions. Nearest-neighbor retrieval baselines can outperform sophisticated closed-book models purely by exploiting data overlap.

The Size-dependent Mutual Information (SMI) metric quantifies a model's likely performance on CBQA tasks using only pre-training signal and model size, without additional training (Jiang et al., 6 Feb 2025). SMI integrates mutual information between subject–object pairs in the pre-training corpus with a scaling factor for model parameters:

SMI(s,o,Φ)=[Norm(log(I(s,o)))]1+1/ΦSMI(s, o, \Phi) = [Norm(\log(I(s, o)))]^{1 + 1/\Phi}

where I(s,o)I(s, o) is the normalized mutual information, NormNorm is a normalization function, and Φ\Phi is model size in billions of parameters. SMI achieves R2R^2 values exceeding $0.84$ in predicting CBQA accuracy across models from $1.1$B to $13$B parameters—demonstrating that it is possible to anticipate performance from corpus characteristics and model scale alone.

Multi-template QA evaluation, involving many paraphrased variants for each question, is recommended for robust measurement of factual recall, as single-template or single-metric approaches are unstable.

7. Future Directions

Research suggests several promising directions for CBQA systems:

  • Parameter efficiency: Development of smaller models or alternative architectures that approach the effectiveness of current multi-billion parameter systems.
  • Interpretable reasoning: Integrating explicit knowledge graph reasoning or rationale generation.
  • Adaptive external querying: Endowing models with learned self-assessment, making API calls or searches only when internal confidence is low (Erbacher et al., 3 Jan 2024).
  • Data and metric refinement: Curating pre-training datasets based on predicted SMI impact and adopting multi-faceted evaluation protocols to measure diverse reasoning and generalization capabilities (Jiang et al., 6 Feb 2025, Ciosici et al., 2021).
  • Hybrid and self-contextualizing strategies: Merging internal LM knowledge with dynamically generated or retrieved context while controlling computation and verifying answer quality (Kokaia et al., 2023, Su et al., 2022).

Systematic integration of these advances is expected to mitigate hallucination, improve factual robustness, and provide transparent, efficient, and scalable solutions to challenging knowledge-intensive question answering tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Closed-Book Question Answering (CBQA).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube