Tracing Multilingual Factual Knowledge Acquisition in Pretraining (2505.14824v1)

Published 20 May 2025 in cs.CL

Abstract: LLMs are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.

Collections

Summary

An Analytical Perspective on Multilingual Factual Knowledge Acquisition in LLMs

The paper "Tracing Multilingual Factual Knowledge Acquisition in Pretraining" investigates the development of multilingual factual recall and crosslingual consistency in LLMs during pretraining, using OLMo-7B as a case paper. It explores two primary pathways through which these models acquire multilingual knowledge: frequency-driven learning and crosslingual transfer. The paper presents a comprehensive investigation that reveals critical insights into the mechanisms and dynamics of factual knowledge absorption across languages.

Summary of Findings

The paper confirms that multilingual factual recall and crosslingual consistency evolve significantly throughout the pretraining process. It identifies the following key findings:

Rapid Early Acquisition: The ability to recall factual knowledge across languages develops rapidly during early pretraining stages. Languages with scripts or linguistic features similar to English reach a relative plateau, continuing to improve at a marginal rate with extended training.
Frequency-Driven Learning: A strong positive correlation exists between fact frequency and recall accuracy. Higher-frequency facts in the pretraining corpus are more consistently recalled by the model, reflecting a predominance of frequency-driven acquisition in multilingual contexts.
Crosslingual Transfer: Although frequency predominantly guides factual recall, crosslingual transfer from English appears to benefit certain low-frequency facts, especially those involving named entities. These instances are observed early in pretraining but are limited in scope and relation types.

Implications for Future Research

The paper’s exploration into the dynamics of factual knowledge acquisition in multilingual LLM environments poses significant implications for both theoretical investigation and practical applications:

Pretraining Strategies: Insights regarding early-stage learning suggest potential optimizations to pretraining strategies—whether to invest more in diverse multilingual data earlier or to adjust training protocols based on desired multilingual capabilities.
Model Design: Considering the impact of fact frequency and crosslingual transfer, researchers might explore embedding techniques or architectural adjustments that enhance systematic recall of factual knowledge while minimizing inconsistencies across languages.
Cross-Script Transfer: The limited nature of crosslingual transfer primarily in named entities highlights the need for novel approaches to improve factual recall for relations involving broader linguistic phenomena, especially in languages with different scripts.

Speculations on Future Directions in AI

As the field advances AI, particularly LLM capabilities, future research may focus on expanding multilingual coverage with less reliance on English-centric paradigms. This includes the development of mechanisms that more robustly facilitate crosslingual transfer across diverse script and language families—perhaps utilizing advanced attention mechanisms or hybrid linguistic models that bridge typological gaps.

Additionally, the findings underscore the importance of leveraging cognitive insights for model training, such as mimicking bilingual humans' flexible recall abilities, which could guide the development of more adaptive and comprehensive models.

In conclusion, the paper provides a critical assessment of multilingual knowledge acquisition in LLMs, offering a foundation for targeted improvements in model training and architecture aimed at enhancing crosslingual consistency and performance across languages. The insights gained here lay the groundwork for more innovative and inclusive approaches in building AI systems capable of truly universal and effective communication.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now