FACTS Parametric: Factual Memory in LLMs
- FACTS Parametric is a benchmark that assesses how language models encode and recall factual information directly in their parameters without external retrieval.
- It employs adversarial filtering and diverse prompt formats to evaluate fact recall via closed-book QA, using metrics like accuracy and hedging rate.
- Insights reveal that larger models with robust training data and prompt engineering improve factual recall, though they may also increase hallucination risks.
A “parametric” fact, or “parametric knowledge,” refers to explicit factual information that is stored directly inside a model’s parameter tensors as a consequence of pre-training or supervised fine-tuning, rather than in an external memory or via retrieval mechanisms. In natural language modeling, parametric recall is evaluated by presenting the model closed-book factoid questions: if it can produce the correct answer purely from inference over its weights—without any access to search engines, databases, or tool-augmented context—then the fact is deemed stored parametrically. The evaluation of parametric factuality has become critical for understanding the internal world knowledge of LLMs and for benchmarking their ability to recall such knowledge reliably. Distinguishing parametric from in-context or retrieval-based knowledge is foundational in both NLP evaluation and neural knowledge systems.
1. Definition and Conceptual Distinctions
Parametric knowledge is defined as factual information (such as the capital of a country, an individual’s date of birth, or a scientific constant) that has been directly encoded in the persistent weights of a model through self-supervised pre-training, supervised learning, or related processes. In contrast, non-parametric (retrieval-based) knowledge is stored outside the model (e.g., in a database or document corpus) and is only accessible at inference time through explicit lookup or retrieval modules (Lin et al., 13 Aug 2025, Zhao et al., 18 Jun 2024).
In formal evaluation settings, a fact ⟨subject, relation, object⟩ is classified as parametrically stored if the model, when presented with a factoid prompt in the absence of any external retrieval, produces the correct object with high accuracy. Typical queries are cloze-style (e.g., “[X] is the capital of [MASK]”) or direct questions (“who is the president of France?”) (Zhao et al., 18 Jun 2024).
Parametric memory therefore refers to the set of all facts and regularities that are implicitly encoded into the model parameter space and can be recalled purely through the forward pass of the network without augmentation or search (Lin et al., 13 Aug 2025, Cheng et al., 11 Dec 2025).
2. Benchmarks and Empirical Measurement
Parametric factuality is systematically measured by specialized benchmarks designed to stress-test the closed-book knowledge of LLMs. The most prominent of these is FACTS Parametric, a sub-benchmark in the FACTS Leaderboard suite, which evaluates LLMs' ability to answer atomic, single-fact questions covering diverse topics relying solely on parameters (Cheng et al., 11 Dec 2025).
FACTS Parametric Benchmark
| Size | 2,104 QA pairs (split evenly public+private) |
|---|---|
| Source | User traffic, adversarial filter, Wikipedia-verified |
| Coverage | Politics, sports, science, arts, “long tail” topics |
| Judging | Gemini 2.5 Pro, three evaluations per item, closed-book |
Each response is automatically labeled as correct, incorrect, not-attempted (“I don’t know”), or unknown. Primary metrics include accuracy, hedging rate, attempted accuracy, and F1-score:
FACTS Parametric’s adversarial curation ensures that questions answered correctly by any open-weight model are filtered out, focusing on the genuinely challenging “long tail” of world knowledge (Cheng et al., 11 Dec 2025).
Other notable multi-faceted parametric knowledge probes include BELIEF and BELIEF-ICL (Zhao et al., 18 Jun 2024), which assess accuracy, consistency, and confidence calibration of the parametric recall using highly diversified, multi-prompt templates.
3. Dynamics of Parametric Knowledge Acquisition
Large-scale studies on factual learning dynamics in LLMs have shown that encoding facts into parametric memory follows a multi-phase trend (Zucchet et al., 27 Mar 2025). During pre-training or fine-tuning:
- Phase I (Statistics learning): The model rapidly captures distributional statistics of attributes (e.g., common locations, frequent professions).
- Phase II (Plateau): Learning stalls at this plateau as attention-based extraction circuits assemble, but individualized fact recall accuracy remains negligible.
- Phase III (Knowledge emergence): Once circuits form, the model rapidly acquires specific entity-relation bindings, and fact recall surges.
The length of Phase II—the “parametric bottleneck”—is primarily determined by the distribution of fact frequencies in the training data. Highly imbalanced (“Zipfian”) data lead to faster acquisition for head entities but slow for tail entities. Curriculum strategies, such as two-stage warmup focusing on subsets, reduce plateau length and improve tail fact learning. A plausible implication is that data scheduling and diversity are as important as model capacity for robust parametric factuality (Zucchet et al., 27 Mar 2025, Lin et al., 13 Aug 2025).
4. Model Scale, Data, and Training Methodology
Parametric factual recall increases nearly monotonically with model parameter count and training data volume, but both scaling and data quality matter. Experimental evidence from BELIEF-ICL (Zhao et al., 18 Jun 2024) shows that:
- Larger models (e.g., Llama3-70B) exhibit higher Acc@1 and consistency in parametric recall relative to smaller counterparts (e.g., Llama2-7B).
- Pretraining on larger, more diverse, and educationally rich corpora enhances fact encoding, especially for low-frequency facts. For example, the Phi3-mini (3.8B), trained on textbook-style data, surpasses Llama2-7B despite its smaller size.
- Training strategies (e.g., whole-word masking in BERT_wwm) and prompt diversity are also significant factors.
Active Reading, a two-stage data augmentation pipeline (strategy generation + synthetic example instantiation), substantially improves parameteric knowledge absorption relative to vanilla document repetition or shallow augmentation. Task-specific Active Reading yields up to +793% relative accuracy gains over base finetuning on SimpleWikiQA and +160% on FinanceBench (Lin et al., 13 Aug 2025).
At pre-training scale, applying Active Reading to 6 million Wikipedia articles (1 trillion tokens), produces the Meta WikiExpert-8B, which outperforms models with orders-of-magnitude more parameters on closed-book QA (Lin et al., 13 Aug 2025).
5. Editing, Hallucinations, and Robustness
Parametric memory is notably difficult to edit. Fine-tuning to inject new facts into an LLM’s weights typically causes parametric memory corruption, erasing unrelated facts (catastrophic forgetting), and is accompanied by a “slow acquisition” effect for new facts (Zucchet et al., 27 Mar 2025). Mechanistically, corruption affects model feed-forward (MLP) “key–value” memories, while attention patterns remain relatively stable.
Stubborn knowledge—facts for which the model has high pre-training confidence—tends to resist “in-context editing” (ICE) approaches: even if ICE boosts the new fact’s probability, the original fact remains overconfident, hampering successful editing (Bi et al., 19 May 2024). The DeCK (Decoding by Contrasting Knowledge) framework addresses this by jointly comparing ICE-augmented and unedited logits at each decoding step, contrastively down-weighting stubborn original parametric knowledge. DeCK improves stubborn fact editability by up to 219% on standard benchmarks and shifts probability mass for edited facts from <0.2 to >0.8 (Bi et al., 19 May 2024).
Hallucinations in parametric models—confident predictions for unseen or fabricated inputs—arise simultaneously with the ability to recall learned facts. Specifically, as the attention extraction circuit forms during training, hallucinations on held-out entities increase (Zucchet et al., 27 Mar 2025). This suggests a fundamental limitation: robust parametric recall is inherently accompanied by hallucination risk for out-of-distribution queries.
6. Prompt Engineering and Probing Methodologies
Prompt diversity and context are critical for eliciting parametric facts. BELIEF-ICL demonstrates that using multiple, diverse templates for each fact (prompt ensembling) and providing in-context exemplars (especially from the same relation/template) dramatically increase recall accuracy (e.g., Llama3-8B: 0.34 zero-shot → 0.73 with 4-template ICL) (Zhao et al., 18 Jun 2024).
There is a pronounced “coverage gap”: no universal prompt elicits all parametric knowledge, and fact recall can increase by over 30–66 percentage points if the best prompt for each specific fact is selected (oracle accuracy). Prompt-based probing demands large-scale prompt sets to avoid underestimating model knowledge (Zhao et al., 18 Jun 2024). This motivates automated prompt diversification (e.g., MyriadLAMA: ~100 templates per relation) and recommends clustering or optimization to achieve robust, tractable evaluation.
7. Limitations, Extensions, and Hybrid Approaches
The principal limitations of parametric factual memory are:
- Catastrophic Overwrite: Fine-tuning for new facts degrades previously stored facts without replay/interleaving, due to feed-forward memory interference (Zucchet et al., 27 Mar 2025).
- Hallucination Risk: Hallucinations increase for entities/facts not present in pre-training data, especially as attention circuits become specialized (Zucchet et al., 27 Mar 2025).
- Stubbornness: Highly confident (frequent) facts are resistant to editing via context; advanced contrastive decoding (e.g., DeCK) provides partial remediation (Bi et al., 19 May 2024).
- Prompt Sensitivity: Factual recall is highly sensitive to prompt design; covering the full knowledge base requires extensive prompt engineering (Zhao et al., 18 Jun 2024).
Hybrid strategies, such as integrating external retrieval with a stable attention backbone or using retrieval-augmented ICE, are suggested to overcome the limitations of purely parametric knowledge—especially for dynamic, long-tail, or time-sensitive facts (Lin et al., 13 Aug 2025, Zhao et al., 18 Jun 2024, Zucchet et al., 27 Mar 2025). Advanced data scheduling, lifelong/active reading frameworks, and dynamic parametric-retrieval blending represent emerging directions for robust factual knowledge modeling.
References:
- (Cheng et al., 11 Dec 2025) The FACTS Leaderboard: A Comprehensive Benchmark for LLM Factuality
- (Zhao et al., 18 Jun 2024) What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in LLMs
- (Zucchet et al., 27 Mar 2025) How do LLMs learn facts? Dynamics, curricula and hallucinations
- (Lin et al., 13 Aug 2025) Learning Facts at Scale with Active Reading
- (Bi et al., 19 May 2024) Decoding by Contrasting Knowledge: Enhancing LLMs' Confidence on Edited Facts