Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

524 4

In-context Pretraining: Language Modeling Beyond Document Boundaries (2310.10638v6)

Published 16 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where LLMs are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

PDF HTML Abstract

In-Context Pretraining: LLMing Beyond Document Boundaries

The paper, In-Context Pretraining: LLMing Beyond Document Boundaries, introduces an innovative method for pretraining LLMs aimed at enhancing their capability to understand and reason across document boundaries. Traditional LLM training methodologies concatenate randomly selected short documents to form input contexts. This method imposes a computational overhead without delivering beneficial pretraining signals, as the preceding documents offer no predictive cue for subsequent documents. Addressing this limitation, the authors propose In-Context Pretraining, which leverages sequences of related documents to provide richer context and improve overall LLM performance.

Methodology

The proposed In-Context Pretraining approach hinges on the idea of enhancing the relationships between sequentially presented documents. This method entails two primary components:

Efficient Nearest Neighbor Search: To identify semantically related documents, an approximate nearest neighbor (ANN) search is employed to create a document graph. This graph helps group documents by their semantic similarity using the contriever model for embedding and finding nearest neighbors.
Document Graph Traversal: Using a graph traversal algorithm formulated as a maximized traveling salesman problem, documents are arranged to optimize semantic coherence in each input context window, ensuring all documents are visited once in a weighted manner.

Experimental Setup and Results

The authors pretrain LLMs ranging from 0.3 to 7 billion parameters on 300 billion tokens obtained from the CommonCrawl dataset. They evaluate the proposed method across various tasks that measure different aspects of LLMing and contextual reasoning: standard LLMing, in-context learning, reading comprehension, retrieval augmentation, and handling of knowledge conflicts.

Key Findings:

LLMing: In-Context Pretraining consistently demonstrated lower perplexity across Wikipedia, Arxiv, and Books datasets (see Figure 1), outperforming standard pretraining and the k-NN baseline.
In-Context Learning: Evaluations on seven text classification datasets showed an average improvement of 8%. This result underscores the model’s superior ability to leverage demonstration examples.
Reading Comprehension: The methodology achieved a 15% average gain across tasks like RACE, SQuAD, and HotpotQA, showcasing enhanced complex contextual reasoning.
Retrieval-Augmentation: The model's performance in open-domain QA tasks improved by 9% when augmented with external knowledge sources, demonstrating alignment and reasoning over extended contexts.
Factuality and Knowledge Conflicts: The proposed method outperformed baselines on knowledge conflict datasets like NQ-Swap and MemoTrap, highlighting improved generation fidelity to prior contexts.

Implications and Future Directions

The implications of these results are substantial for both theoretical advancements and practical applications in artificial intelligence. The demonstrated improvements in understanding and reasoning across longer and more varied contexts suggest that LLMs trained with In-Context Pretraining could be substantially better at tasks requiring deep contextual comprehension, more accurate retrieval-augmentation, and robust handling of factual consistency.

Future developments could explore the cross-linguistic applications of this algorithm by grouping related documents in multilingual corpora. Moreover, investigating the inherent connections within specific domains, such as code repositories or medical texts, could extend the relevance and applicability of this approach. Integrating this pretraining approach with multitask finetuning strategies could further enhance its effectiveness, particularly for instruction-based models.

In-Context Pretraining offers a promising and scalable direction that merges well with existing pretraining pipelines by altering the preprocessing steps. This straightforward yet impactful innovation paves the way for constructing more coherent and contextually aware LLMs that set the stage for advancements in understanding, generating, and reasoning over text within and beyond document boundaries.

PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (12)

Weijia Shi (55 papers)
Sewon Min (45 papers)
Maria Lomeli (20 papers)
Chunting Zhou (36 papers)
Margaret Li (16 papers)
Rich James (4 papers)
Xi Victoria Lin (39 papers)
Noah A. Smith (224 papers)
Luke Zettlemoyer (225 papers)
Scott Yih (6 papers)
Mike Lewis (78 papers)
Gergely Szilvasy (6 papers)

Citations (40)

View on Semantic Scholar

Tweets

https://twitter.com/WeijiaShi2/status/1747441801949262003

https://twitter.com/WeijiaShi2/status/1787506358155202850

https://twitter.com/PandaAshwinee/status/1801317574832996502

https://twitter.com/potato_y_salad/status/1750553420828733906

https://twitter.com/mrdrozdov/status/1759289895032402316

https://twitter.com/mrdrozdov/status/1752434930053743016

YouTube

Show All Videos