Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

Published 11 May 2026 in cs.CL and cs.AI | (2605.10640v1)

Abstract: Continual Pre-Training (CPT) is essential for enabling LLMs (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a theoretical framework analyzing token-level dynamics in a single-layer Transformer to understand continual factual knowledge acquisition.
It shows that data replay, specifically the novel STOC algorithm, significantly preserves pretraining knowledge while integrating new information.
Empirical results confirm that STOC outperforms regularization methods in mitigating catastrophic forgetting across synthetic and real-world datasets.

Understanding Continual Factual Knowledge Acquisition: Theoretical Insights and Algorithmic Advances

Introduction

This work addresses the fundamental problem of continual factual knowledge acquisition (cFKA) in LMs, with a specific focus on the mechanisms that govern knowledge integration and retention during continual pre-training (CPT). Despite the widespread application of CPT to update LMs with emerging domains or facts, catastrophic forgetting—the erasure of pre-existing knowledge when adapting to new data—remains poorly understood at a mechanistic level. The authors construct a rigorous theoretical framework—anchored in the dynamics of a single-layer Transformer—to analyze cFKA, providing insights that unify and extend empirical findings in both simplified and practical LM settings. The study critically assesses standard continual learning (CL) strategies—regularization-based and data replay methods—and proposes a novel generative data replay algorithm, Selecting Tokens via attentiOn Contribution (STOC), which leverages token-level attention dynamics to guide replay generation. Extensive experiments substantiate the theoretical predictions and empirical effectiveness of STOC in mitigating catastrophic forgetting.

Theoretical Framework for cFKA Dynamics

The theoretical core of the paper is a meticulous analysis of training dynamics in a simplified single-layer Transformer, with parameter learning rates decoupled between the feedforward (Y) and attention (Z) modules. The factual knowledge is represented in the standard (subject, relation, object) schema, aligned with both synthetic and real datasets, and the Next Token Prediction objective is maintained for PT and CPT phases.

The principal theoretical results include:

Frequency-based Token Storage: The factual information is partitioned and stored at the token level, such that higher-frequency tokens converge more rapidly. The convergence of parameter Y is driven toward a reference state encoding Bayesian-optimal predictions, with the error term decaying exponentially at a rate modulated by token frequency and associated diversity in object mapping.
Diversity-Aware Attention Assignment: Attention scores for tokens are shown to be inversely correlated with a derived Diversity Index (DI), essentially a function of the entropy of object associations per token. Tokens with distributed associations across many objects (high DI) receive lower attention scores. This mechanistic insight matches empirical observations of attention focusing during knowledge attribution.
Role of Data Augmentation: Through analysis and experiments on augmented versus non-augmented datasets, the framework predicts and verifies that data augmentation (multiple textual expressions per fact) reduces the DI for relation tokens and generalizes the model's prediction across formats by shifting attention toward subject information.

Crucially, all theoretical predictions are validated not only in controlled synthetic settings but also on modern multi-layer LMs, supporting the claim that these simplified dynamics persist in practical large-scale LLMs.

Continual Learning: Regularization vs. Data Replay

The study provides a unified theoretical and empirical examination of two prevalent CPT strategies:

Regularization-Based Methods

Regularization—such as Elastic Weight Consolidation (EWC)—adds parameter constraints to limit drift away from pretraining optima. Theoretical analysis reveals that regularization influences only the convergence rate, not the convergence point, of factual knowledge parameters. Thus, regularization cannot fundamentally prevent catastrophic forgetting in cFKA; it merely slows it. Experiments confirm a decelerated forgetting trajectory but ultimately show substantial loss of pretraining knowledge regardless of regularization strength.

Data Replay Methods

Data replay, on the other hand, integrates samples from pretraining during CPT. The framework establishes that replay shifts the convergence point, directly preserving previously acquired knowledge in model parameters. Even a small replay fraction is shown to stabilize old knowledge significantly. Empirically, systematic replay outperforms random or limited replay, highlighting the need for broad factual coverage in replay samples. The results demonstrate that replay not only mitigates forgetting but can modulate retention confidence by amplifying oscillations around the convergence point.

STOC: Attention-Guided Generative Data Replay

Building on these dynamics, the authors introduce STOC, a generative replay algorithm that exploits the transformer-specific mechanism of diversity-aware attention:

Token Contribution Scoring: For each CPT sample, STOC computes aggregated attention scores per token, identifying tokens that tightly constrain prediction distribution (i.e., those with high knowledge specificity).
Replay Generation: High-contribution tokens seed the generation of replay prompts, which the pretraining LM then uses to generate pseudo-replay data enriched for retained factual information.
Data Quality Filtering: Optional deduplication and relevance-based filtering refine generated replay instances.

Comprehensive evaluation shows that STOC consistently achieves higher knowledge retention and better performance on new knowledge compared to existing generative replay methods (e.g., LAMOL), across synthetic biography tasks and real-world benchmarks such as ZSRE, Wiki_Bio, Wiki_Recent, and domain-specific corpora (including legal datasets). STOC's improvements are stable under scalable CPT and maintain performance on large, heterogeneous continual pretraining datasets.

Implications, Limitations, and Future Directions

From a practical standpoint, these findings advance the state of continual learning for LMs by providing algorithmic tools and theoretical guarantees for balancing the acquisition of new information with the preservation of prior knowledge—essential in safety-critical or domain-adaptive LM deployments. STOC’s transformer-aware design, rooted in mechanistic attention analysis, is broadly applicable and composable with other CL techniques such as layer freezing.

Theoretically, the explicit link between token-level statistics and attention allocation elucidates the distributed storage of factual knowledge, offering new paradigms for probing, interpreting, and editing LMs. The scaling of pretraining and CPT interventions (data augmentation, replay fraction) can now be better predicted and optimized in deployment.

Limitations include the reliance on a synthetic data framework for tight theoretical-experimental correspondence, the reduced model depth in tractable analysis, and the absence of formal extension to complex multi-layer attention mechanisms or nonlinear activations present in large LMs. The generalization to more expressive factual formats and further architectural variants presents promising future research directions.

Conclusion

This work establishes a rigorous, theory-driven foundation for continual factual knowledge acquisition in LMs, demonstrates that standard regularization cannot avert catastrophic forgetting while replay-based strategies can both in theory and practice, and proposes a novel STOC algorithm for effective generative replay. Empirical studies substantiate strong numerical results for STOC across synthetic and real benchmarks. These results have substantive implications for the design and maintenance of continually evolving LMs and open new avenues for scalable, theory-informed continual learning in modern NLP systems.

For in-depth technical details and code, see (2605.10640).

Markdown Report Issue