DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing (2212.03597v3)
Abstract: Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B LLM pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.
- COPA: Constrained PARAFAC2 for sparse & large datasets. In CIKM.
- Azure, M. 2023. Pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
- Curriculum learning. In ICML.
- Semantic parsing on freebase from question-answer pairs. In EMNLP.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI.
- Results of the wmt17 neural mt training task. In Proceedings of the second conference on machine translation.
- Language Models are Few-Shot Learners. In NeurIPS.
- Campos, D. 2021. Curriculum learning for language modeling. arXiv:2108.02170.
- Palm: Scaling language modeling with pathways. arXiv:2204.02311.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv:1905.10044.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457.
- Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4): 1–220.
- Imagenet: A large-scale hierarchical image database. In CVPR.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv:2101.00027.
- GitHub. 2021. GitHub Copilot. https://github.com/features/copilot/.
- Google. 2023. PaLM 2 Technical Report. https://ai.google/static/documents/palm2techreport.pdf.
- PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In ICML.
- Training Compute-Optimal Large Language Models. arXiv:2203.15556.
- Token Dropping for Efficient BERT Pretraining. In ACL.
- Deep networks with stochastic depth. In ECCV.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551.
- Scaling laws for neural language models. arXiv:2001.08361.
- Length-adaptive transformer: Train once with length drop, use anytime with search. arXiv:2010.07003.
- Learned token pruning for transformers. arXiv:2107.00910.
- Curriculum Learning and Minibatch Bucketing in Neural Machine Translation. In RANLP.
- Peeling the onion: Hierarchical reduction of data redundancy for efficient vision transformer training. In AAAI.
- Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto.
- Race: Large-scale reading comprehension dataset from examinations. arXiv:1704.04683.
- The winograd schema challenge. In KR.
- DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597.
- The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. In NeurIPS.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv:2202.07800.
- Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313–330.
- Are sixteen heads really better than one? NeurIPS.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv:1809.02789.
- MosaicML. 2022. Sequence Length Warmup, MosaicML Composer. https://docs.mosaicml.com/en/v0.11.1/method˙cards/seq˙length˙warmup.html.
- Adversarial NLI: A new benchmark for natural language understanding. arXiv:1910.14599.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv:1606.06031.
- Competence-based Curriculum Learning for Neural Machine Translation. In NAACL-HLT.
- Shortformer: Better Language Modeling using Shorter Inputs. arXiv:2012.15832.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv:2108.12409.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML.
- Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125.
- High-resolution image synthesis with latent diffusion models. In CVPR.
- Easy questions first? a case study on curriculum learning for question answering. In ACL.
- Self-training for jointly learning to ask and answer questions. In NAACL.
- Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv:2201.11990.
- Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives. In ACL.
- Attention is all you need. In NeurIPS.
- Analyzing the structure of attention in a transformer language model. arXiv:1906.04284.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv:1905.09418.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning. In HPCA.
- Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
- Curriculum learning for natural language understanding. In ACL.
- Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering. arXiv:1911.07176.
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. arXiv:2203.03466.
- Universally slimmable networks and improved training techniques. In ICCV.
- HellaSwag: Can a machine really finish your sentence? arXiv:1905.07830.
- Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885.
- Reducing BERT Computation by Padding Removal and Curriculum Learning. In ISPASS.
- An Empirical Exploration of Curriculum Learning for Neural Machine Translation. arXiv:1811.00739.
- Curriculum Learning for Domain Adaptation in Neural Machine Translation. In NAACL-HLT.