Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum (2405.13226v1)

Published 21 May 2024 in cs.CL and cs.LG
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Abstract: LLMs are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training LLMs: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

Dataset Decomposition: Enhancing LLM Training through Variable Sequence Length Curriculum

This paper introduces a significant improvement in the efficiency and effectiveness of training LLMs by proposing a novel technique called Dataset Decomposition (DD). The motivation for this research stems from the established but suboptimal practice of preparing fixed-length token sequences for LLM training. The conventional approach, termed "concat-and-chunk," involves random concatenation of documents followed by chunking into specific sequence lengths. This can inadvertently lead to cross-document attention and increased computational costs owing to the quadratic complexity of attention mechanisms.

The paper's central contribution is the introduction of Dataset Decomposition, combined with Variable Sequence Length (VSL) training. Dataset Decomposition involves reorganizing a dataset into a collection of buckets, each containing sequences of a fixed length—these sequences are derived from unique documents, thereby eliminating unnecessary cross-document attention. The method leverages this decomposition to conduct training using variable sequence lengths and batch sizes, selected through a curriculum.

A key highlight is the empirical demonstration that the DD approach allows training an 8k context-length 1B model at the same cost as a 2k context-length model using the baseline method. Moreover, the proposed approach achieves target accuracy approximately three times faster than the baseline when evaluated on standard language tasks and long-context benchmarks. This acceleration in reaching accuracy targets underscores both data and training efficiency, suggesting potential reductions in computational resource consumption that are beneficial for scaling LLMs.

The paper also addresses the often-overlooked aspect of sequence length distribution. By utilizing sequence length as prior knowledge, the authors demonstrate that optimizing sequence mixtures and curricula leads to varying performance impacts on different natural language and long-context tasks.

The results convey robust performance improvements, particularly in enhancing accuracy and training speed on a large-scale corpus with over 137 billion tokens. The application of the proposed DD and VSL strategies across multiple model sizes reaffirms its scalability and effectiveness.

One of the paper's distinctive analytical aspects is the examination of sequence length bias. The investigation reveals that the alignment between pretraining sequence lengths and the evaluation tasks' requirements plays a crucial role in optimizing performance. This insight invites further exploration into refining data mixtures tailored to target tasks, underscoring an approach that balances efficiency against complexity.

While Dataset Decomposition marks a substantial advancement in LLM training, the paper acknowledges that the technique's benefits are predominantly noteworthy in scenarios involving training with extended sequence lengths. Therefore, the direct computational savings from DD may not be as pronounced where sequence lengths do not present a significant computational overhead.

In conclusion, the paper outlines a methodologically sound and practically significant approach to overcoming limitations in traditional LLM training methodologies. By effectively reducing unnecessary computational burdens and enhancing training speed, Dataset Decomposition provides a pathway for more efficient resource utilization in LLM development. Researchers in the field may further explore this innovative approach's implications on varied language tasks, expanding the scope of LLM applications. With the groundwork laid by this paper, future advancements could delve into broader applications of curriculum-based training and extend these principles to other machine learning modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Big-bench qa wikidata. https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/qa_wikidata. [Used through LLM Foundry].
  2. Common crawl. https://commoncrawl.org.
  3. Jeopardy. https://huggingface.co/datasets/jeopardy. [Used custom curated version by LLM Foundry].
  4. Llm foundry v0.7.0. https://github.com/mosaicml/llm-foundry.
  5. L-eval: Instituting standardized evaluation for long context language models, 2023.
  6. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
  7. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  8. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  9. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  10. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  11. Supervised and unsupervised transfer learning for question answering. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018.
  12. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  14. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  16. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36, 2024.
  17. Fewer truncations improve language modeling. arXiv preprint arXiv:2404.10830, 2024.
  18. Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.
  19. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  20. Speed up training with variable length inputs by efficient batching strategies. In Interspeech, pages 156–160, 2021.
  21. On batching variable size inputs for training end-to-end speech enhancement systems. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  22. OpenLM: a minimal but performative language modeling (lm) repository, 2023. GitHub repository.
  23. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  24. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  25. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027, 2021.
  26. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  27. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300, 2019.
  28. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  29. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  30. The stability-efficiency dilemma: Investigating sequence length warmup for training gpt models. Advances in Neural Information Processing Systems, 35:26736–26750, 2022.
  31. An inverse scaling law for clip training. Advances in Neural Information Processing Systems, 36, 2024.
  32. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
  33. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
  34. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  35. Cvnets: High performance library for computer vision. In Proceedings of the 30th ACM International Conference on Multimedia, pages 7327–7330, 2022.
  36. Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024.
  37. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  38. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36, 2024.
  39. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  40. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  41. The what, why, and how of context length extension techniques in large language models–a detailed survey. arXiv preprint arXiv:2401.07872, 2024.
  42. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  43. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. 2023.
  44. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  45. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  46. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  47. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
  48. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  49. Randomized positional encodings boost length generalization of transformers. arXiv preprint arXiv:2305.16843, 2023.
  50. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  51. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638, 2023.
  52. Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
  53. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024.
  54. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  57. Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine, 2016.
  58. Sequence length is a domain: Length-based overfitting in transformer models. arXiv preprint arXiv:2109.07276, 2021.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  61. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  62. Transformers can achieve length generalization but not robustly. arXiv preprint arXiv:2402.09371, 2024.
  63. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hadi Pouransari (32 papers)
  2. Chun-Liang Li (60 papers)
  3. Jen-Hao Rick Chang (18 papers)
  4. Pavan Kumar Anasosalu Vasu (11 papers)
  5. Cem Koc (3 papers)
  6. Vaishaal Shankar (31 papers)
  7. Oncel Tuzel (62 papers)
Citations (4)