In-Context Pretraining: LLMing Beyond Document Boundaries
The paper, In-Context Pretraining: LLMing Beyond Document Boundaries, introduces an innovative method for pretraining LLMs aimed at enhancing their capability to understand and reason across document boundaries. Traditional LLM training methodologies concatenate randomly selected short documents to form input contexts. This method imposes a computational overhead without delivering beneficial pretraining signals, as the preceding documents offer no predictive cue for subsequent documents. Addressing this limitation, the authors propose In-Context Pretraining, which leverages sequences of related documents to provide richer context and improve overall LLM performance.
Methodology
The proposed In-Context Pretraining approach hinges on the idea of enhancing the relationships between sequentially presented documents. This method entails two primary components:
- Efficient Nearest Neighbor Search: To identify semantically related documents, an approximate nearest neighbor (ANN) search is employed to create a document graph. This graph helps group documents by their semantic similarity using the contriever model for embedding and finding nearest neighbors.
- Document Graph Traversal: Using a graph traversal algorithm formulated as a maximized traveling salesman problem, documents are arranged to optimize semantic coherence in each input context window, ensuring all documents are visited once in a weighted manner.
Experimental Setup and Results
The authors pretrain LLMs ranging from 0.3 to 7 billion parameters on 300 billion tokens obtained from the CommonCrawl dataset. They evaluate the proposed method across various tasks that measure different aspects of LLMing and contextual reasoning: standard LLMing, in-context learning, reading comprehension, retrieval augmentation, and handling of knowledge conflicts.
Key Findings:
- LLMing: In-Context Pretraining consistently demonstrated lower perplexity across Wikipedia, Arxiv, and Books datasets (see Figure 1), outperforming standard pretraining and the k-NN baseline.
- In-Context Learning: Evaluations on seven text classification datasets showed an average improvement of 8%. This result underscores the model’s superior ability to leverage demonstration examples.
- Reading Comprehension: The methodology achieved a 15% average gain across tasks like RACE, SQuAD, and HotpotQA, showcasing enhanced complex contextual reasoning.
- Retrieval-Augmentation: The model's performance in open-domain QA tasks improved by 9% when augmented with external knowledge sources, demonstrating alignment and reasoning over extended contexts.
- Factuality and Knowledge Conflicts: The proposed method outperformed baselines on knowledge conflict datasets like NQ-Swap and MemoTrap, highlighting improved generation fidelity to prior contexts.
Implications and Future Directions
The implications of these results are substantial for both theoretical advancements and practical applications in artificial intelligence. The demonstrated improvements in understanding and reasoning across longer and more varied contexts suggest that LLMs trained with In-Context Pretraining could be substantially better at tasks requiring deep contextual comprehension, more accurate retrieval-augmentation, and robust handling of factual consistency.
Future developments could explore the cross-linguistic applications of this algorithm by grouping related documents in multilingual corpora. Moreover, investigating the inherent connections within specific domains, such as code repositories or medical texts, could extend the relevance and applicability of this approach. Integrating this pretraining approach with multitask finetuning strategies could further enhance its effectiveness, particularly for instruction-based models.
In-Context Pretraining offers a promising and scalable direction that merges well with existing pretraining pipelines by altering the preprocessing steps. This straightforward yet impactful innovation paves the way for constructing more coherent and contextually aware LLMs that set the stage for advancements in understanding, generating, and reasoning over text within and beyond document boundaries.