How Do Large Language Models Acquire Factual Knowledge During Pretraining? (2406.11813v3)

Published 17 Jun 2024 in cs.CL

Abstract: Despite the recent observation that LLMs can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.

PDF HTML Abstract

LLMs and Factual Knowledge Acquisition

The paper "How Do LLMs Acquire Factual Knowledge During Pretraining?" explores the underexplored area of how LLMs assimilate factual information during their pretraining phase. It challenges existing assumptions and proposes novel insights drawn from empirical analysis.

Key Insights and Observations

Data Scaling and Knowledge Acquisition: Contrary to expectations, the research finds that increasing the volume of pretraining data does not necessarily enhance a model's ability to acquire and sustain factual knowledge substantially. This counters the common belief tied to scaling laws that relate model performance improvements directly to dataset size and model parameters.
Power-Law Forgetting: The paper identifies a power-law relationship governing the forgetting dynamics of memorized and generalized knowledge. Notably, LLMs tend to forget facts faster when trained with duplicated data. This insight emphasizes the nuanced balance between data repetitiveness and model memory retention.
Batch Size Effects: Larger batch sizes are shown to improve the resilience of LLMs to knowledge forgetting. This finding aligns with the pragmatic training approaches that leverage larger batch sizes, suggesting tangible benefits in the context of maintaining learned knowledge over time.
Micro-Acquisition Dynamics: The acquisition process is characterized by incremental increases in knowledge probability with each training step. However, the model progressively forgets this knowledge unless it is reinforced through repeated exposure. This dynamic contributes to understanding the slow accumulation and subsequent dilution of factual knowledge within LLM parameters.

Implications and Speculative Future Directions

The implications of these findings are multifaceted, impacting both theoretical understanding and practical strategies in LLM training:

Popularity and Knowledge Retention: Popularity in pretraining datasets plays a critical role, with frequent encounters allowing models to integrate facts effectively before they are forgotten. Thus, strategies focusing on data diversity could be key to maximizing knowledge retention across a broader factual spectrum.
Importance of Data Cleaning: The observed advantages of deduplication underscore the importance of data pre-processing. By reducing data redundancy, models can better generalize facts beyond mere memorization, enhancing their practical utility in real-world applications.
Model Design and Training Protocols: Insights into the differing impacts of model scaling and batch size adjustments offer actionable guidelines for designing efficient pretraining regimes. These findings may lead to more sophisticated training schedules that capitalize on batch size variations to optimize knowledge acquisition and retention.

Conclusions

This paper enriches the understanding of factual knowledge dynamics in LLM pretraining. The methodologies and results presented serve as a foundation for future research, inviting further exploration into optimizing LLM architectures and training strategies to enhance knowledge extraction and application. This endeavor represents a step towards refining LLMs for robust, versatile, and reliable deployment across diverse informational tasks.