How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training (2502.11196v2)

Published 16 Feb 2025 in cs.LG, cs.AI, cs.CL, cs.CV, and cs.HC

Abstract: Despite exceptional capabilities in knowledge-intensive tasks, LLMs face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.

Summary

The paper reveals that new knowledge is acquired more efficiently when it relates to existing data, as shown by higher Hit@10 scores for relevant circuits.
It details a biphasic evolution in knowledge circuits, with a rapid formation phase followed by an optimization phase that stabilizes the circuit topology.
The study identifies a hierarchical deep-to-shallow pattern in LLM layers, where mid-to-deeper layers initially extract and lower layers later refine knowledge representations.

This paper investigates how LLMs acquire new knowledge during continual pre-training, focusing on the evolution of "knowledge circuits." Knowledge circuits are defined as computational subgraphs within the LLM responsible for storing and processing specific pieces of knowledge. The research analyzes these circuits from performance, topology, and component perspectives across different model architectures (GPT-2, Llama, Phi) using synthetically generated factual knowledge.

Key Findings:

The paper reveals three primary insights into new knowledge acquisition in LLMs:

Knowledge Relevance Principle: The efficiency of acquiring new knowledge is significantly influenced by its relationship to pre-existing knowledge within the model. LLMs integrate "relevant new knowledge" (extensions of existing concepts) more readily than "completely new knowledge" (information entirely novel to the model). Performance on knowledge acquisition tasks, measured by Hit@10, consistently shows that circuits for relevant knowledge ( $K_\text{rel}$ ) outperform those for completely new knowledge ( $K_\text{compl}$ ).
Biphasic Circuit Evolution: The development of knowledge circuits during continual pre-training occurs in two distinct phases:
- Formation Phase: Characterized by rapid structural changes as the model initially builds the necessary pathways for new knowledge. This phase sees a quick decrease in "knowledge circuit entropy," indicating a concentration of importance on a few critical edges.
- Optimization Phase: Marked by a stabilization of the circuit's topology. Further performance gains in this phase come from refining the computations within the established components rather than significant structural alterations. The rate of entropy decrease slows down, and structural similarity to the final circuit configuration increases more slowly.
Deep-to-Shallow Pattern: The evolution of components within knowledge circuits follows a specific hierarchical pattern:
- Initially (during the formation phase), mid-to-deeper layers of the LLM develop the primary knowledge extraction functions. This is evidenced by an increase in "mover heads" (attention heads that transfer information about the subject) and activated edges in these layers, alongside a decrease in "relation heads" (attention heads focusing on relation tokens).
- Subsequently (during the optimization phase), lower layers of the model focus on enriching the representations of knowledge. The topological structure stabilizes, and performance improvements are driven by these enriched representations. This is further supported by observing the "early decoding" phenomenon (where the target attribute can be decoded from intermediate layers) becoming prominent around the phase shift point.

Methodology:

Dataset Construction: Factual knowledge was synthesized as triples (subject, relation, attribute), e.g., (Name, birth date, YYYY-MM-DD). This allowed for controlled experiments with knowledge guaranteed to be new to the models. Datasets were constructed to differentiate between:
- Relevant New Knowledge ( $K_\text{rel}$ ): New attributes for known subjects (e.g., fictional details for real celebrity names).
- Completely New Knowledge ( $K_\text{compl}$ ): Entirely fictional entities and their attributes.
- Knowledge frequency was modeled with an exponential distribution to simulate real-world long-tail distributions.
Model Training: Decoder-only LLMs (GPT-2 Small, GPT-2 Medium, TinyLlama, Phi-1.5) were continually pre-trained on this synthetic corpus using a standard next-token prediction objective.
Circuit Discovery: The Edge Attribution Patching with Integrated Gradients (EAP-IG) method was used to identify knowledge circuits. EAP-IG assigns an importance score to each edge in the model's computation graph. Circuits were formed by selecting the top N edges that achieved over 70% of the whole model's performance on specific factual recall tasks. The importance score $S(e)$ for an edge $e=(u,v)$ is given by:

$S(e)=\left(z_u^{\prime}-z_u\right) \frac{1}{m} \sum_{k=1}^m \frac{\partial L\left(z^{\prime}+\frac{k}{m}\left(z-z^{\prime}\right)\right)}{\partial z_v}$

where $z_u$ and $z'_u$ are clean and corrupted activations at node $u$ , $L$ is the loss function, and $m$ is the number of integration steps.

Analysis Details:

Performance Analysis: Measured using Hit@10, which calculates the proportion of times the target attribute's first token appears in the top 10 predicted tokens.
1
Hit@10 = (1/|D_test|) * Σ I(rank_a ≤ 10)
Topology Analysis:
- Structural Consistency: Jaccard Similarity between edge/node sets of intermediate circuits and the final circuit.
- Topological Centralization: Measured by "knowledge circuit entropy" $H(\mathcal{C})$ :
  
  $P(e)=\frac{S(e)}{\sum_{e^{\prime} \in E_\mathcal{C} S(e^{\prime})}, \quad \forall e \in E_\mathcal{C}$
  
  $H(\mathcal{C})=-\sum_{e\in E_{\mathcal{C}}} P(e)\log P(e)$
  
  A lower entropy indicates a more centralized topology with importance concentrated on fewer edges.
- The paper also analyzed the performance of circuits whose topologies were fixed to specific checkpoints (Init, Before phase shift, After phase shift, Last) to demonstrate the impact of topological evolution.
Components Analysis:
- Specialized Nodes: Tracked the proportion of specialized attention heads (mover heads, relation heads, mixture heads) within the circuits. Mover heads attend to subject tokens to extract attributes. Relation heads attend to relation tokens. Mixture heads combine both.
- Activated Edges: Analyzed the layer-wise distribution of edge activation ratios (proportion of edges from a layer included in the circuit).
- Changes in Vocabulary Space: Monitored the rank and probability of the target attribute token by unembedding outputs from intermediate layers (logit lens).

Practical Implications:

The findings suggest potential avenues for improving continual pre-training:

Data Curriculums: The "Knowledge Relevance Principle" implies that structuring new data to relate to existing knowledge (e.g., by mimicking original corpus structure) could enhance learning efficiency.
Adaptive Training Strategies: The "Biphasic Circuit Evolution" suggests that the state of knowledge circuits could serve as an indicator for different training phases, potentially allowing for dynamic adjustments to training methods or data presentation based on whether the model is in a formation or optimization phase.
Long-Tail Knowledge Retention: The observation that low-frequency knowledge performance might be due to insufficient representation rather than circuit capacity limitations suggests that strategies like knowledge augmentation or reactivation could improve retention.
Understanding Forgetting: An appendix section on forgetting analysis suggests that even after behavioral forgetting, LLMs might retain "knowledge circuit elasticity," meaning components can be reactivated, especially with data replay.

Limitations:

The paper focused on decoder-only Transformer LMs and did not include encoder-decoder or encoder-only models.
Analysis was limited to models up to 1.3B parameters due to computational constraints and circuit discovery method limitations, thus not covering larger models that might use techniques like Grouped Query Attention (GQA).
The impact of novel training techniques beyond standard next-token prediction for continual learning was not analyzed.

Overall, the paper provides a mechanistic interpretability perspective on how LLMs internalize new information, offering valuable insights into the dynamic structural changes occurring within the model during continual pre-training. This understanding could lead to more effective and efficient strategies for keeping LLMs updated with new knowledge.

PDF Markdown

Tweets

https://twitter.com/Kseniase_/status/1893595068986212612

https://twitter.com/kdelko/status/1910569660271255807

https://twitter.com/GptMaestro/status/1893839081387200980

YouTube

Show All Videos

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training (2502.11196v2)

Summary

Related Papers

Tweets

YouTube