Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-context Pretraining: Language Modeling Beyond Document Boundaries (2310.10638v6)

Published 16 Oct 2023 in cs.CL, cs.AI, and cs.LG
In-context Pretraining: Language Modeling Beyond Document Boundaries

Abstract: LLMs (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where LLMs are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

In-Context Pretraining: LLMing Beyond Document Boundaries

The paper, In-Context Pretraining: LLMing Beyond Document Boundaries, introduces an innovative method for pretraining LLMs aimed at enhancing their capability to understand and reason across document boundaries. Traditional LLM training methodologies concatenate randomly selected short documents to form input contexts. This method imposes a computational overhead without delivering beneficial pretraining signals, as the preceding documents offer no predictive cue for subsequent documents. Addressing this limitation, the authors propose In-Context Pretraining, which leverages sequences of related documents to provide richer context and improve overall LLM performance.

Methodology

The proposed In-Context Pretraining approach hinges on the idea of enhancing the relationships between sequentially presented documents. This method entails two primary components:

  1. Efficient Nearest Neighbor Search: To identify semantically related documents, an approximate nearest neighbor (ANN) search is employed to create a document graph. This graph helps group documents by their semantic similarity using the contriever model for embedding and finding nearest neighbors.
  2. Document Graph Traversal: Using a graph traversal algorithm formulated as a maximized traveling salesman problem, documents are arranged to optimize semantic coherence in each input context window, ensuring all documents are visited once in a weighted manner.

Experimental Setup and Results

The authors pretrain LLMs ranging from 0.3 to 7 billion parameters on 300 billion tokens obtained from the CommonCrawl dataset. They evaluate the proposed method across various tasks that measure different aspects of LLMing and contextual reasoning: standard LLMing, in-context learning, reading comprehension, retrieval augmentation, and handling of knowledge conflicts.

Key Findings:

  • LLMing: In-Context Pretraining consistently demonstrated lower perplexity across Wikipedia, Arxiv, and Books datasets (see Figure 1), outperforming standard pretraining and the k-NN baseline.
  • In-Context Learning: Evaluations on seven text classification datasets showed an average improvement of 8%. This result underscores the model’s superior ability to leverage demonstration examples.
  • Reading Comprehension: The methodology achieved a 15% average gain across tasks like RACE, SQuAD, and HotpotQA, showcasing enhanced complex contextual reasoning.
  • Retrieval-Augmentation: The model's performance in open-domain QA tasks improved by 9% when augmented with external knowledge sources, demonstrating alignment and reasoning over extended contexts.
  • Factuality and Knowledge Conflicts: The proposed method outperformed baselines on knowledge conflict datasets like NQ-Swap and MemoTrap, highlighting improved generation fidelity to prior contexts.

Implications and Future Directions

The implications of these results are substantial for both theoretical advancements and practical applications in artificial intelligence. The demonstrated improvements in understanding and reasoning across longer and more varied contexts suggest that LLMs trained with In-Context Pretraining could be substantially better at tasks requiring deep contextual comprehension, more accurate retrieval-augmentation, and robust handling of factual consistency.

Future developments could explore the cross-linguistic applications of this algorithm by grouping related documents in multilingual corpora. Moreover, investigating the inherent connections within specific domains, such as code repositories or medical texts, could extend the relevance and applicability of this approach. Integrating this pretraining approach with multitask finetuning strategies could further enhance its effectiveness, particularly for instruction-based models.

In-Context Pretraining offers a promising and scalable direction that merges well with existing pretraining pipelines by altering the preprocessing steps. This straightforward yet impactful innovation paves the way for constructing more coherent and contextually aware LLMs that set the stage for advancements in understanding, generating, and reasoning over text within and beyond document boundaries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1644–1650, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.148. URL https://aclanthology.org/2020.findings-emnlp.148.
  3. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  4. CDLM: Cross-document language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2648–2662, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.225. URL https://aclanthology.org/2021.findings-emnlp.225.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  6. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  719–730, 2022.
  7. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  8. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  9. Harm de Vries. In the long (context) run, 2023. URL https://www.harmdevries.com/post/context-length/.
  10. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
  11. The turking test: Can language models understand instructions? ArXiv, abs/2010.11982, 2020. URL https://api.semanticscholar.org/CorpusID:225062157.
  12. Merrill M Flood. The traveling-salesman problem. Operations research, 4(1):61–75, 1956.
  13. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4849–4870, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.267. URL https://aclanthology.org/2023.acl-long.267.
  14. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  15. Data-efficient finetuning using cross-task nearest neighbors. In Findings of ACL, 2023. URL https://arxiv.org/abs/2212.00196.
  16. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0.
  17. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  18. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  19. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33:117–28, 01 2011. doi: 10.1109/TPAMI.2010.57.
  20. kaiokendev. Things i’m learning while training superhot, 2023. URL https://kaiokendev.github.io/til#.
  21. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  22. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  23. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167–195, 2015.
  24. The inductive bias of in-context learning: Rethinking pretraining example design. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=lnEaqbTJIRz.
  25. Pre-training via paraphrasing. Advances in Neural Information Processing Systems, 33:18470–18481, 2020.
  26. Ra-dit: Retrieval-augmented dual instruction tuning, 2023.
  27. The memotrap dataset. https://github.com/inverse-scaling/prize/blob/main/data-release/README.md, 2023.
  28. Lost in the middle: How language models use long contexts, 2023.
  29. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7052–7063, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.565. URL https://aclanthology.org/2021.emnlp-main.565.
  30. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  31. Inverse scaling: When bigger isn’t better, 2023.
  32. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. URL https://aclanthology.org/2022.naacl-main.201.
  33. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  2097–2118, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.132. URL https://aclanthology.org/2023.findings-acl.132.
  34. OpenAI. Gpt-4 technical report, 2023.
  35. Training language models to follow instructions with human feedback, 2022.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  37. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, 2016.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  39. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.823.
  40. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp. 31210–31227. PMLR, 2023a.
  41. Trusting your evidence: Hallucinate less with context-aware decoding, 2023b.
  42. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023c.
  43. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.
  44. One embedder, any task: Instruction-finetuned text embeddings, 2023.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  46. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  47. Resolving knowledge conflicts in large language models. arXiv preprint arXiv:2310.00935, 2023a.
  48. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  49. How far can camels go? exploring the state of instruction tuning on open resources, 2023b.
  50. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, 2020.
  51. Taking notes on the fly helps language pre-training. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=lU5Rs_wCweN.
  52. Recomp: Improving retrieval-augmented lms with compression and selective augmentation, 2023.
  53. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  54. LinkBERT: Pretraining language models with document links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8003–8016, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.551. URL https://aclanthology.org/2022.acl-long.551.
  55. Retrieval-augmented multimodal language modeling. 2023.
  56. Dict-BERT: Enhancing language model pre-training with dictionary. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  1907–1918, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.150. URL https://aclanthology.org/2022.findings-acl.150.
  57. Democratizing access to large-scale language models with opt-175b. Meta AI, 2022.
  58. Character-level convolutional networks for text classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015a. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf.
  59. Character-level convolutional networks for text classification. In NIPS, 2015b.
  60. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249, 2020.
  61. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR, 2021.
  62. Training language models with memory augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5657–5673, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.382. URL https://aclanthology.org/2022.emnlp-main.382.
  63. Context-faithful prompting for large language models. ArXiv, abs/2303.11315, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Weijia Shi (55 papers)
  2. Sewon Min (45 papers)
  3. Maria Lomeli (20 papers)
  4. Chunting Zhou (36 papers)
  5. Margaret Li (16 papers)
  6. Rich James (4 papers)
  7. Xi Victoria Lin (39 papers)
  8. Noah A. Smith (224 papers)
  9. Luke Zettlemoyer (225 papers)
  10. Scott Yih (6 papers)
  11. Mike Lewis (78 papers)
  12. Gergely Szilvasy (6 papers)
Citations (40)
Youtube Logo Streamline Icon: https://streamlinehq.com