$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (2410.23261v1)
Abstract: Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. Architectural Support for Programming Languages and Operating Systems (ASPLOS).
- Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory. Transactions on Mathematical Software (TOMS).
- Stas Bekman. 2023–2024. Machine Learning Engineering Open Book. https://github.com/stas00/ml-engineering.
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. International Conference on Machine Learning (ICML).
- PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research (JMLR).
- Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. International Conference on Learning Representations (ICLR).
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Neural Information Processing Systems (NeurIPS).
- The Efficiency Misnomer. International Conference on Learning Representations (ICLR).
- Tim Dettmers. 2018. A Full Hardware Guide to Deep Learning. https://timdettmers.com/2018/12/16/deep-learning-hardware-guide.
- Tim Dettmers. 2023. Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning. https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning.
- 8-bit Optimizers via Block-wise Quantization. International Conference on Learning Representations (ICLR).
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).
- Hugging Face. 2024. Transformers (Documentation). https://huggingface.co/docs/transformers/v4.42.4/en/perf_train_gpu_one and https://huggingface.co/docs/transformers/v4.42.4/en/perf_train_gpu_many.
- William Falcon and the PyTorch Lightning team. 2019. PyTorch Lightning. https://github.com/Lightning-AI/lightning.
- Jonas Geiping and Tom Goldstein. 2023. Cramming: Training a Language Model on a Single GPU in One Day. International Conference on Machine Learning (ICML).
- Deep Learning Tuning Playbook. http://github.com/google-research/tuning_playbook.
- OLMo: Accelerating the Science of Language Models. Association for Computational Linguistics (ACL).
- Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Conference on Language Modeling (COLM).
- Efficient Parallelization Layouts for Large-Scale Distributed Model Training. Conference on Language Modeling (COLM).
- FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention. https://pytorch.org/blog/flexattention.
- Liger Kernel: Efficient Triton Kernels for LLM Training. Preprint, arXiv:2410.10989.
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. Neural Information Processing Systems (NeurIPS).
- How to Train BERT with an Academic Budget. Empirical Methods in Natural Language Processing (EMNLP).
- Jean Kaddour. 2022. Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging. Has it Trained Yet? Workshop at NeurIPS.
- No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. Neural Information Processing Systems (NeurIPS).
- Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research. Preprint, arXiv:2306.16900.
- Memory Efficient Optimizers with 4-bit States. Neural Information Processing Systems (NeurIPS).
- 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed. International Conference on High Performance Computing, Data, and Analytics (HiPC).
- ReLoRA: High-Rank Training Through Low-Rank Updates. International Conference on Learning Representations (ICLR).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint, arXiv:1907.11692.
- A ConvNet for the 2020s. Computer Vision and Pattern Recognition (CVPR).
- CAME: Confidence-guided Adaptive Memory Efficient Optimization. Association for Computational Linguistics (ACL).
- Mixed Precision Training. International Conference on Learning Representations (ICLR).
- FP8 Formats for Deep Learning. Preprint, arXiv:2209.05433.
- PipeDream: Generalized pipeline parallelism for DNN training. Symposium on Operating Systems Principles (SOSP).
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
- Piotr Nawrot. 2023. nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources. Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP.
- NVIDIA. 2023. Optimizing Linear/Fully-Connected Layers. https://docs.nvidia.com/deeplearning/performance/pdf/Optimizing-Linear-Fully-Connected-Layers-User-Guide.pdf.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems (NeurIPS).
- MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining. Neural Information Processing Systems (NeurIPS).
- PyTorch. 2024. torchao: PyTorch Architecture Optimization. https://github.com/pytorch/ao.
- Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML).
- ZeRO: Memory optimizations Toward Training Trillion Parameter Models. International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
- DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. International Conference on Knowledge Discovery & Data Mining (KDD).
- ZeRO-Offload: Democratizing Billion-Scale Model Training. USENIX Annual Technical Conference.
- Inheritune: Training Smaller Yet More Attentive Language Models. Preprint, arXiv:2404.08634.
- Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget. Preprint, arXiv:2407.15811.
- Mesh-TensorFlow: Deep Learning for Supercomputers. Neural Information Processing Systems (NeurIPS).
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. International Conference on Machine Learning (ICML).
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. Preprint, arXiv:1909.08053.
- Dusan Stosic. 2020. Training Neural Networks with Tensor Cores. Tutorial on Accelerating Computer Vision with Mixed Precision at ECCV.
- 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. International Conference on Machine Learning (ICML).
- The Mosaic ML Team. 2021. composer. https://github.com/mosaicml/composer.
- Image Captioners Are Scalable Vision Learners Too. Neural Information Processing Systems (NeurIPS).
- Dataset Distillation. Preprint, arXiv:1811.10959.
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. Preprint, arXiv:1910.03771.
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. International Conference on Very Large Data Bases (VLDB).