- The paper presents a theoretical framework analyzing token-level dynamics in a single-layer Transformer to understand continual factual knowledge acquisition.
- It shows that data replay, specifically the novel STOC algorithm, significantly preserves pretraining knowledge while integrating new information.
- Empirical results confirm that STOC outperforms regularization methods in mitigating catastrophic forgetting across synthetic and real-world datasets.
Understanding Continual Factual Knowledge Acquisition: Theoretical Insights and Algorithmic Advances
Introduction
This work addresses the fundamental problem of continual factual knowledge acquisition (cFKA) in LMs, with a specific focus on the mechanisms that govern knowledge integration and retention during continual pre-training (CPT). Despite the widespread application of CPT to update LMs with emerging domains or facts, catastrophic forgetting—the erasure of pre-existing knowledge when adapting to new data—remains poorly understood at a mechanistic level. The authors construct a rigorous theoretical framework—anchored in the dynamics of a single-layer Transformer—to analyze cFKA, providing insights that unify and extend empirical findings in both simplified and practical LM settings. The study critically assesses standard continual learning (CL) strategies—regularization-based and data replay methods—and proposes a novel generative data replay algorithm, Selecting Tokens via attentiOn Contribution (STOC), which leverages token-level attention dynamics to guide replay generation. Extensive experiments substantiate the theoretical predictions and empirical effectiveness of STOC in mitigating catastrophic forgetting.
Theoretical Framework for cFKA Dynamics
The theoretical core of the paper is a meticulous analysis of training dynamics in a simplified single-layer Transformer, with parameter learning rates decoupled between the feedforward (Y) and attention (Z) modules. The factual knowledge is represented in the standard (subject, relation, object) schema, aligned with both synthetic and real datasets, and the Next Token Prediction objective is maintained for PT and CPT phases.
The principal theoretical results include:
- Frequency-based Token Storage: The factual information is partitioned and stored at the token level, such that higher-frequency tokens converge more rapidly. The convergence of parameter Y is driven toward a reference state encoding Bayesian-optimal predictions, with the error term decaying exponentially at a rate modulated by token frequency and associated diversity in object mapping.
- Diversity-Aware Attention Assignment: Attention scores for tokens are shown to be inversely correlated with a derived Diversity Index (DI), essentially a function of the entropy of object associations per token. Tokens with distributed associations across many objects (high DI) receive lower attention scores. This mechanistic insight matches empirical observations of attention focusing during knowledge attribution.
- Role of Data Augmentation: Through analysis and experiments on augmented versus non-augmented datasets, the framework predicts and verifies that data augmentation (multiple textual expressions per fact) reduces the DI for relation tokens and generalizes the model's prediction across formats by shifting attention toward subject information.
Crucially, all theoretical predictions are validated not only in controlled synthetic settings but also on modern multi-layer LMs, supporting the claim that these simplified dynamics persist in practical large-scale LLMs.
Continual Learning: Regularization vs. Data Replay
The study provides a unified theoretical and empirical examination of two prevalent CPT strategies:
Regularization-Based Methods
Regularization—such as Elastic Weight Consolidation (EWC)—adds parameter constraints to limit drift away from pretraining optima. Theoretical analysis reveals that regularization influences only the convergence rate, not the convergence point, of factual knowledge parameters. Thus, regularization cannot fundamentally prevent catastrophic forgetting in cFKA; it merely slows it. Experiments confirm a decelerated forgetting trajectory but ultimately show substantial loss of pretraining knowledge regardless of regularization strength.
Data Replay Methods
Data replay, on the other hand, integrates samples from pretraining during CPT. The framework establishes that replay shifts the convergence point, directly preserving previously acquired knowledge in model parameters. Even a small replay fraction is shown to stabilize old knowledge significantly. Empirically, systematic replay outperforms random or limited replay, highlighting the need for broad factual coverage in replay samples. The results demonstrate that replay not only mitigates forgetting but can modulate retention confidence by amplifying oscillations around the convergence point.
STOC: Attention-Guided Generative Data Replay
Building on these dynamics, the authors introduce STOC, a generative replay algorithm that exploits the transformer-specific mechanism of diversity-aware attention:
- Token Contribution Scoring: For each CPT sample, STOC computes aggregated attention scores per token, identifying tokens that tightly constrain prediction distribution (i.e., those with high knowledge specificity).
- Replay Generation: High-contribution tokens seed the generation of replay prompts, which the pretraining LM then uses to generate pseudo-replay data enriched for retained factual information.
- Data Quality Filtering: Optional deduplication and relevance-based filtering refine generated replay instances.
Comprehensive evaluation shows that STOC consistently achieves higher knowledge retention and better performance on new knowledge compared to existing generative replay methods (e.g., LAMOL), across synthetic biography tasks and real-world benchmarks such as ZSRE, Wiki_Bio, Wiki_Recent, and domain-specific corpora (including legal datasets). STOC's improvements are stable under scalable CPT and maintain performance on large, heterogeneous continual pretraining datasets.
Implications, Limitations, and Future Directions
From a practical standpoint, these findings advance the state of continual learning for LMs by providing algorithmic tools and theoretical guarantees for balancing the acquisition of new information with the preservation of prior knowledge—essential in safety-critical or domain-adaptive LM deployments. STOC’s transformer-aware design, rooted in mechanistic attention analysis, is broadly applicable and composable with other CL techniques such as layer freezing.
Theoretically, the explicit link between token-level statistics and attention allocation elucidates the distributed storage of factual knowledge, offering new paradigms for probing, interpreting, and editing LMs. The scaling of pretraining and CPT interventions (data augmentation, replay fraction) can now be better predicted and optimized in deployment.
Limitations include the reliance on a synthetic data framework for tight theoretical-experimental correspondence, the reduced model depth in tractable analysis, and the absence of formal extension to complex multi-layer attention mechanisms or nonlinear activations present in large LMs. The generalization to more expressive factual formats and further architectural variants presents promising future research directions.
Conclusion
This work establishes a rigorous, theory-driven foundation for continual factual knowledge acquisition in LMs, demonstrates that standard regularization cannot avert catastrophic forgetting while replay-based strategies can both in theory and practice, and proposes a novel STOC algorithm for effective generative replay. Empirical studies substantiate strong numerical results for STOC across synthetic and real benchmarks. These results have substantive implications for the design and maintenance of continually evolving LMs and open new avenues for scalable, theory-informed continual learning in modern NLP systems.
For in-depth technical details and code, see (2605.10640).