Analysing The Impact of Sequence Composition on Language Model Pre-Training (2402.13991v1)

Published 21 Feb 2024 in cs.CL

Abstract: Most LLM pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on LLMling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of LLMs without sacrificing efficiency.

PDF HTML Abstract

Analyzing the Impact of Sequence Composition on LLM Pre-Training

This paper addresses the often-overlooked issue of sequence composition during LLM pre-training, specifically focusing on the implications of causal masking and document packing strategies on model performance. While many LLMs concatenate documents into fixed-length sequences with causal masking for efficiency, the influence of this approach on generalization remains underexplored. The authors reveal that this widely-adopted strategy can inadvertently introduce distracting information, negatively impacting LLMing and downstream tasks.

The authors propose intra-document causal masking to mitigate this distraction, where the likelihood of each token is conditioned only on previous tokens within the same document. This approach is contrasted with traditional causal masking, which conditions each token on all preceding tokens, irrespective of document boundaries. The results demonstrate that intra-document causal masking significantly enhances modeling performance but increases runtime by approximately 4%.

The paper further explores document context relevance in pre-training sequences. The paper compares three packing strategies: Mix (random document sampling), Uni (documents from a single source), and Bm25 (retrieval-based relevant document packing). Notably, the Bm25 method, which leverages efficient retrieval to construct more contextually coherent sequences, showed marked improvement across various model capabilities. Significant boosts in in-context learning (up to 11.6%), knowledge memorization (9.8%), and context utilization (7.2%) were observed, evidencing its efficacy.

Quantitative analysis underscored that without considering document boundaries during causal masking, irrelevant information from previous documents is more likely to corrupt model learning, leading to suboptimal task performance. This contributes to the understanding that enhancing document relatedness in sequences can confer benefits, helping models to focus on pertinent context, thus mitigating distractions.

The implications of this research are both practical and theoretical. From a practical perspective, these insights can inform the design of more efficient pre-training pipelines, potentially impacting how LLMs are developed and optimized. Theoretically, the paper raises questions about the relationship between sequence composition, context robustness, and model generalization capabilities, suggesting avenues for further investigation into the nuanced interplay between dataset structure and learning outcomes.

In summary, this paper elucidates the impact of sequence composition strategies on LLM pre-training. By proposing and validating intra-document causal masking and retrieval-based sequence construction, it provides a platform for more focused and efficient model training processes, prompting a reevaluation of standard pre-training practices in the landscape of LLMs. The paper encourages future research into optimizing context relevance in training data to enhance model understanding and execution across diverse tasks.

PDF Markdown Bookmark Chat (Pro)

References (43)

Authors (8)

Yu Zhao (207 papers)
Yuanbin Qu (1 paper)
Konrad Staniszewski (6 papers)
Szymon Tworkowski (7 papers)
Wei Liu (1135 papers)
Yuxiang Wu (27 papers)
Pasquale Minervini (88 papers)
Piotr Miłoś (52 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/PMinervini/status/1781080046972604739

https://twitter.com/PMinervini/status/1791885844137046290

https://twitter.com/PMinervini/status/1817497713610461681

https://twitter.com/PMinervini/status/1800481472006299753

https://twitter.com/PMinervini/status/1816499181764083938

https://twitter.com/yuzhaouoe/status/1780547806354092539

Analysing The Impact of Sequence Composition on Language Model Pre-Training (2402.13991v1)

Analyzing the Impact of Sequence Composition on LLM Pre-Training

Related Papers

Tweets