Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 30 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 116 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations (2509.03405v1)

Published 3 Sep 2025 in cs.CL

Abstract: LLMs (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.

Summary

The paper develops LMEnt, a comprehensive suite that annotates and indexes entity mentions from pretraining data, providing fine-grained traceability to model representations.
It demonstrates that entity-based retrieval outperforms string-based methods, achieving up to 80.4% win rates and maintaining over 97% precision at deeper retrieval depths.
Scaling studies reveal that larger LMEnt models improve recall for frequent entity pairs, emphasizing the role of co-occurrence frequency in knowledge acquisition dynamics.

LMEnt: A Suite for Analyzing Knowledge in LLMs from Pretraining Data to Representations

Introduction and Motivation

The LMEnt suite addresses a critical gap in the paper of knowledge acquisition in LMs: the lack of fine-grained, entity-level traceability from pretraining data to model representations. Existing approaches for analyzing knowledge in LMs typically rely on post-hoc string-based retrieval, which is insufficient for robustly mapping semantically equivalent information due to alias ambiguity and variability in phrasing. LMEnt introduces a comprehensive framework for annotating, indexing, and analyzing entity mentions in pretraining corpora, enabling precise tracking of knowledge acquisition and facilitating causal interventions.

Figure 1: LMEnt suite overview: entity annotation, entity-based retrieval index, and 12 models with traceable entity exposure across training steps.

Entity Annotation Pipeline

LMEnt leverages English Wikipedia as a knowledge-rich pretraining corpus, annotating each document with fine-grained entity mentions using three complementary sources: Wikipedia hyperlinks, entity linking (ReFinED), and coreference resolution (Maverick). This multi-source annotation enables disambiguation between entities with similar surface forms and captures both explicit and implicit references, including pronouns and descriptive phrases.

Figure 2: Disambiguation of entity mentions in the "Josh Allen" document, demonstrating explicit and implicit linking and coreference clustering.

The annotation pipeline is designed for scalability, utilizing 8 H100 GPUs and processing 3.6B tokens into 10.5M chunks, yielding 400M entity mentions across 7.3M entities. Each mention is mapped to a Wikidata QID and assigned confidence scores from the respective annotation sources, supporting flexible retrieval and filtering.

Entity-Based Retrieval and Indexing

LMEnt constructs an Elasticsearch index of all pretraining chunks, each annotated with entity mentions and their QIDs. Retrieval is performed by matching on QID and source-specific confidence thresholds, enabling high-precision identification of all training steps where a given entity was observed. This approach overcomes the limitations of string-based retrieval, which suffers from low recall and high noise due to alias expansion and ambiguous surface forms.

Pretrained Model Suite

The LMEnt suite includes 12 transformer models (OLMo-2 architecture) with 170M, 600M, and 1B parameters, each trained for 1, 2, 4, and 6 epochs on the annotated Wikipedia corpus. Intermediate checkpoints (every 1,000 steps) are released, providing granular visibility into knowledge acquisition dynamics. The 170M model is compute-optimal for the token budget, and all models are trained with variable sequence length curriculum to avoid spurious inter-document correlations.

Empirical Evaluation: Knowledge Recall and Retrieval

Knowledge Recall Benchmarks

LMEnt models are evaluated on PopQA and PAQ, two entity-centric QA benchmarks. Despite being trained on only 0.03%–4.7% of the tokens used for comparable models, LMEnt achieves competitive performance: 7.4% accuracy on all PopQA entities and 66% on popular entities, closely matching Pythia-1.4B and OLMo-1B, and trailing OLMo-2-1B and SmoLLM-1.7B primarily due to recall failures on rare facts.

Figure 3: LMEnt models match or exceed the compute efficiency of open-source baselines on popular PopQA entities.

Scaling model size improves recall for facts where subject and answer entities co-occur frequently, but has limited effect on tail facts.

Figure 4: Larger LMEnt models better learn associations for frequently co-occurring entity pairs.

Entity-Based vs. String-Based Retrieval

LMEnt's entity-based retrieval outperforms string-based methods (case-sensitive/insensitive, canonical/expanded) by 66.7%–80.4% in pairwise win rates, with ablation studies showing that hyperlinks and entity linking are the most critical components.

Figure 5: LMEnt retrieval win rates against string-based methods and ablations; entity linking and hyperlinks are essential.

LMEnt retrieves more relevant chunks for torso and tail entities, which constitute 99.7% of Wikipedia entities, while string-based expanded variants suffer from excessive noise.

Figure 6: LMEnt achieves superior chunk coverage for rare entities compared to string-based baselines.

Precision at increasing retrieval depths remains above 97% for LMEnt, while string-based methods degrade to 84% and 27%.

Figure 7: LMEnt maintains high precision at all retrieval depths, unlike string-based approaches.

Entity co-occurrence in training data is a stronger predictor of model performance than pageview popularity, suggesting that knowledge acquisition is tightly coupled to the frequency of joint exposure.

Figure 8: Model accuracy correlates best with subject-answer chunk co-occurrence, not entity popularity.

Knowledge Acquisition Dynamics

LMEnt enables analysis of knowledge learning and forgetting across training checkpoints. Fact frequency (number of co-occurring subject-answer chunks) correlates with both learning and forgetting rates, but the underlying mechanisms remain unclear. Notably, both rates increase with frequency, indicating complex plasticity in knowledge representations.

Figure 9: Both learning and forgetting rates increase with fact frequency across LMEnt-1B-6E checkpoints.

Implementation and Practical Considerations

Annotation Pipeline: Requires substantial GPU resources for coreference and entity linking; scalable to larger corpora with further optimization.
Indexing: Elasticsearch index supports efficient QID-based retrieval and flexible filtering by source and confidence.
Model Training: OLMo-2 backbone, variable sequence length curriculum, and compute-optimal sizing ensure efficient training and controlled analysis.
Evaluation: LLM-as-a-judge (Gemini 2.5 Flash) is statistically validated for chunk precision assessment, enabling scalable evaluation of retrieval quality.

Implications and Future Directions

LMEnt provides a controlled, extensible testbed for studying knowledge representations, plasticity, editing, and attribution in LMs. The suite facilitates mechanistic interpretability by enabling causal interventions and precise tracking of entity exposure. Extensions to knowledge-poor corpora, larger model architectures (e.g., mixture-of-experts), and mid/post-training phases are straightforward, with entity linking alone providing robust annotation in the absence of hyperlinks.

Potential applications include:

Knowledge Editing: Controlled injection and removal of facts during pretraining.
Factuality Enhancement: Data ordering and explicit mention replacement to improve recall.
Scaling Studies: Analysis of knowledge acquisition across model and data scale.
Mechanistic Interpretability: Tracing latent circuits and representations linked to entity exposure.

Conclusion

LMEnt establishes a rigorous framework for analyzing the interplay between pretraining data and knowledge representations in LLMs. By combining fine-grained entity annotation, high-precision retrieval, and a suite of pretrained models with traceable exposure, LMEnt enables detailed paper of knowledge acquisition dynamics and supports a wide range of research in LM interpretability, editing, and factuality. The open-source release of data, models, and checkpoints ensures reproducibility and extensibility for future work in the field.