Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum (2405.13226v1)

Published 21 May 2024 in cs.CL and cs.LG

Abstract: LLMs are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training LLMs: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

PDF HTML Abstract

Dataset Decomposition: Enhancing LLM Training through Variable Sequence Length Curriculum

This paper introduces a significant improvement in the efficiency and effectiveness of training LLMs by proposing a novel technique called Dataset Decomposition (DD). The motivation for this research stems from the established but suboptimal practice of preparing fixed-length token sequences for LLM training. The conventional approach, termed "concat-and-chunk," involves random concatenation of documents followed by chunking into specific sequence lengths. This can inadvertently lead to cross-document attention and increased computational costs owing to the quadratic complexity of attention mechanisms.

The paper's central contribution is the introduction of Dataset Decomposition, combined with Variable Sequence Length (VSL) training. Dataset Decomposition involves reorganizing a dataset into a collection of buckets, each containing sequences of a fixed length—these sequences are derived from unique documents, thereby eliminating unnecessary cross-document attention. The method leverages this decomposition to conduct training using variable sequence lengths and batch sizes, selected through a curriculum.

A key highlight is the empirical demonstration that the DD approach allows training an 8k context-length 1B model at the same cost as a 2k context-length model using the baseline method. Moreover, the proposed approach achieves target accuracy approximately three times faster than the baseline when evaluated on standard language tasks and long-context benchmarks. This acceleration in reaching accuracy targets underscores both data and training efficiency, suggesting potential reductions in computational resource consumption that are beneficial for scaling LLMs.

The paper also addresses the often-overlooked aspect of sequence length distribution. By utilizing sequence length as prior knowledge, the authors demonstrate that optimizing sequence mixtures and curricula leads to varying performance impacts on different natural language and long-context tasks.

The results convey robust performance improvements, particularly in enhancing accuracy and training speed on a large-scale corpus with over 137 billion tokens. The application of the proposed DD and VSL strategies across multiple model sizes reaffirms its scalability and effectiveness.

One of the paper's distinctive analytical aspects is the examination of sequence length bias. The investigation reveals that the alignment between pretraining sequence lengths and the evaluation tasks' requirements plays a crucial role in optimizing performance. This insight invites further exploration into refining data mixtures tailored to target tasks, underscoring an approach that balances efficiency against complexity.

While Dataset Decomposition marks a substantial advancement in LLM training, the paper acknowledges that the technique's benefits are predominantly noteworthy in scenarios involving training with extended sequence lengths. Therefore, the direct computational savings from DD may not be as pronounced where sequence lengths do not present a significant computational overhead.

In conclusion, the paper outlines a methodologically sound and practically significant approach to overcoming limitations in traditional LLM training methodologies. By effectively reducing unnecessary computational burdens and enhancing training speed, Dataset Decomposition provides a pathway for more efficient resource utilization in LLM development. Researchers in the field may further explore this innovative approach's implications on varied language tasks, expanding the scope of LLM applications. With the groundwork laid by this paper, future advancements could delve into broader applications of curriculum-based training and extend these principles to other machine learning modalities.

PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (7)

Hadi Pouransari (32 papers)
Chun-Liang Li (60 papers)
Jen-Hao Rick Chang (18 papers)
Pavan Kumar Anasosalu Vasu (11 papers)
Cem Koc (3 papers)
Vaishaal Shankar (31 papers)
Oncel Tuzel (62 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/Grad62304977/status/1891061228992106706

https://twitter.com/OncelTuzel/status/1794021823371468877

https://twitter.com/HPouransari/status/1794133663149404436

https://twitter.com/mmarshall/status/1796541441449324838

https://twitter.com/Grad62304977/status/1891072641169023254

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum (2405.13226v1)

Dataset Decomposition: Enhancing LLM Training through Variable Sequence Length Curriculum

Related Papers

Tweets