Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analysing The Impact of Sequence Composition on Language Model Pre-Training (2402.13991v1)

Published 21 Feb 2024 in cs.CL
Analysing The Impact of Sequence Composition on Language Model Pre-Training

Abstract: Most LLM pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on LLMling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of LLMs without sacrificing efficiency.

Analyzing the Impact of Sequence Composition on LLM Pre-Training

This paper addresses the often-overlooked issue of sequence composition during LLM pre-training, specifically focusing on the implications of causal masking and document packing strategies on model performance. While many LLMs concatenate documents into fixed-length sequences with causal masking for efficiency, the influence of this approach on generalization remains underexplored. The authors reveal that this widely-adopted strategy can inadvertently introduce distracting information, negatively impacting LLMing and downstream tasks.

The authors propose intra-document causal masking to mitigate this distraction, where the likelihood of each token is conditioned only on previous tokens within the same document. This approach is contrasted with traditional causal masking, which conditions each token on all preceding tokens, irrespective of document boundaries. The results demonstrate that intra-document causal masking significantly enhances modeling performance but increases runtime by approximately 4%.

The paper further explores document context relevance in pre-training sequences. The paper compares three packing strategies: Mix (random document sampling), Uni (documents from a single source), and Bm25 (retrieval-based relevant document packing). Notably, the Bm25 method, which leverages efficient retrieval to construct more contextually coherent sequences, showed marked improvement across various model capabilities. Significant boosts in in-context learning (up to 11.6%), knowledge memorization (9.8%), and context utilization (7.2%) were observed, evidencing its efficacy.

Quantitative analysis underscored that without considering document boundaries during causal masking, irrelevant information from previous documents is more likely to corrupt model learning, leading to suboptimal task performance. This contributes to the understanding that enhancing document relatedness in sequences can confer benefits, helping models to focus on pertinent context, thus mitigating distractions.

The implications of this research are both practical and theoretical. From a practical perspective, these insights can inform the design of more efficient pre-training pipelines, potentially impacting how LLMs are developed and optimized. Theoretically, the paper raises questions about the relationship between sequence composition, context robustness, and model generalization capabilities, suggesting avenues for further investigation into the nuanced interplay between dataset structure and learning outcomes.

In summary, this paper elucidates the impact of sequence composition strategies on LLM pre-training. By proposing and validating intra-document causal masking and retrieval-based sequence construction, it provides a platform for more focused and efficient model training processes, prompting a reevaluation of standard pre-training practices in the landscape of LLMs. The paper encourages future research into optimizing context relevance in training data to enhance model understanding and execution across diverse tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. CoRR, abs/2303.09540.
  2. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1644–1650. Association for Computational Linguistics.
  3. Pythia: A suite for analyzing large language models across training and scaling. CoRR, abs/2304.01373.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Data distributional properties drive emergent in-context learning in transformers. In NeurIPS.
  6. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691.
  7. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  8. Xinyang Geng. 2023. Easylm: A simple and scalable training framework for large language models.
  9. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4849–4870. Association for Computational Linguistics.
  10. Textbooks are all you need. CoRR, abs/2306.11644.
  11. Understanding in-context learning via supportive pretraining data. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 12660–12673. Association for Computational Linguistics.
  12. Training compute-optimal large language models. CoRR, abs/2203.15556.
  13. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
  14. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 874–880. Association for Computational Linguistics.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  17. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  18. Jean Kaddour. 2023. The minipile challenge for data-efficient language models. CoRR, abs/2304.08442.
  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  20. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  21. RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 785–794. Association for Computational Linguistics.
  22. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8424–8445. Association for Computational Linguistics.
  23. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
  24. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
  25. The inductive bias of in-context learning: Rethinking pretraining example design. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  26. Lost in the middle: How language models use long contexts.
  27. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
  28. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  29. Faster causal attention over large sequences through sparse flash attention. CoRR, abs/2306.01160.
  30. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics.
  31. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
  32. On the effect of pretraining corpora on in-context learning by a large-scale language model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 5168–5186. Association for Computational Linguistics.
  33. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
  34. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
  35. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631–1642. ACL.
  36. D4: improving LLM pretraining via document de-duplication and diversification. CoRR, abs/2308.12284.
  37. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  38. Focused transformer: Contrastive training for context scaling. CoRR, abs/2307.03170.
  39. Doremi: Optimizing data mixtures speeds up language model pretraining. CoRR, abs/2305.10429.
  40. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2369–2380. Association for Computational Linguistics.
  41. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  42. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  43. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yu Zhao (207 papers)
  2. Yuanbin Qu (1 paper)
  3. Konrad Staniszewski (6 papers)
  4. Szymon Tworkowski (7 papers)
  5. Wei Liu (1135 papers)
  6. Yuxiang Wu (27 papers)
  7. Pasquale Minervini (88 papers)
  8. Piotr Miłoś (52 papers)
Citations (7)