Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (2410.23261v1)

Published 30 Oct 2024 in cs.CL and cs.LG

Abstract: Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. Architectural Support for Programming Languages and Operating Systems (ASPLOS).
  2. Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory. Transactions on Mathematical Software (TOMS).
  3. Stas Bekman. 2023–2024. Machine Learning Engineering Open Book. https://github.com/stas00/ml-engineering.
  4. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. International Conference on Machine Learning (ICML).
  5. PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research (JMLR).
  6. Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. International Conference on Learning Representations (ICLR).
  7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Neural Information Processing Systems (NeurIPS).
  8. The Efficiency Misnomer. International Conference on Learning Representations (ICLR).
  9. Tim Dettmers. 2018. A Full Hardware Guide to Deep Learning. https://timdettmers.com/2018/12/16/deep-learning-hardware-guide.
  10. Tim Dettmers. 2023. Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning. https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning.
  11. 8-bit Optimizers via Block-wise Quantization. International Conference on Learning Representations (ICLR).
  12. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).
  13. Hugging Face. 2024. Transformers (Documentation). https://huggingface.co/docs/transformers/v4.42.4/en/perf_train_gpu_one and https://huggingface.co/docs/transformers/v4.42.4/en/perf_train_gpu_many.
  14. William Falcon and the PyTorch Lightning team. 2019. PyTorch Lightning. https://github.com/Lightning-AI/lightning.
  15. Jonas Geiping and Tom Goldstein. 2023. Cramming: Training a Language Model on a Single GPU in One Day. International Conference on Machine Learning (ICML).
  16. Deep Learning Tuning Playbook. http://github.com/google-research/tuning_playbook.
  17. OLMo: Accelerating the Science of Language Models. Association for Computational Linguistics (ACL).
  18. Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Conference on Language Modeling (COLM).
  19. Efficient Parallelization Layouts for Large-Scale Distributed Model Training. Conference on Language Modeling (COLM).
  20. FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention. https://pytorch.org/blog/flexattention.
  21. Liger Kernel: Efficient Triton Kernels for LLM Training. Preprint, arXiv:2410.10989.
  22. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. Neural Information Processing Systems (NeurIPS).
  23. How to Train BERT with an Academic Budget. Empirical Methods in Natural Language Processing (EMNLP).
  24. Jean Kaddour. 2022. Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging. Has it Trained Yet? Workshop at NeurIPS.
  25. No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. Neural Information Processing Systems (NeurIPS).
  26. Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research. Preprint, arXiv:2306.16900.
  27. Memory Efficient Optimizers with 4-bit States. Neural Information Processing Systems (NeurIPS).
  28. 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed. International Conference on High Performance Computing, Data, and Analytics (HiPC).
  29. ReLoRA: High-Rank Training Through Low-Rank Updates. International Conference on Learning Representations (ICLR).
  30. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Preprint, arXiv:1907.11692.
  31. A ConvNet for the 2020s. Computer Vision and Pattern Recognition (CVPR).
  32. CAME: Confidence-guided Adaptive Memory Efficient Optimization. Association for Computational Linguistics (ACL).
  33. Mixed Precision Training. International Conference on Learning Representations (ICLR).
  34. FP8 Formats for Deep Learning. Preprint, arXiv:2209.05433.
  35. PipeDream: Generalized pipeline parallelism for DNN training. Symposium on Operating Systems Principles (SOSP).
  36. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
  37. Piotr Nawrot. 2023. nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources. Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP.
  38. NVIDIA. 2023. Optimizing Linear/Fully-Connected Layers. https://docs.nvidia.com/deeplearning/performance/pdf/Optimizing-Linear-Fully-Connected-Layers-User-Guide.pdf.
  39. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems (NeurIPS).
  40. MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining. Neural Information Processing Systems (NeurIPS).
  41. PyTorch. 2024. torchao: PyTorch Architecture Optimization. https://github.com/pytorch/ao.
  42. Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML).
  43. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
  44. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. International Conference on Knowledge Discovery & Data Mining (KDD).
  45. ZeRO-Offload: Democratizing Billion-Scale Model Training. USENIX Annual Technical Conference.
  46. Inheritune: Training Smaller Yet More Attentive Language Models. Preprint, arXiv:2404.08634.
  47. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget. Preprint, arXiv:2407.15811.
  48. Mesh-TensorFlow: Deep Learning for Supercomputers. Neural Information Processing Systems (NeurIPS).
  49. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. International Conference on Machine Learning (ICML).
  50. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. Preprint, arXiv:1909.08053.
  51. Dusan Stosic. 2020. Training Neural Networks with Tensor Cores. Tutorial on Accelerating Computer Vision with Mixed Precision at ECCV.
  52. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. International Conference on Machine Learning (ICML).
  53. The Mosaic ML Team. 2021. composer. https://github.com/mosaicml/composer.
  54. Image Captioners Are Scalable Vision Learners Too. Neural Information Processing Systems (NeurIPS).
  55. Dataset Distillation. Preprint, arXiv:1811.10959.
  56. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. Preprint, arXiv:1910.03771.
  57. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. International Conference on Very Large Data Bases (VLDB).

Summary

  • The paper presents an empirical study quantifying GPU resource trade-offs in AI pre-training using typical academic setups.
  • It benchmarks varied training configurations, demonstrating that optimized strategies can replicate high-end model training with fewer GPUs over extended durations.
  • The study provides actionable cost-benefit analyses and technical optimizations that make cutting-edge AI research accessible to academia.

Overview of "$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources"

The paper "$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources" addresses a pressing issue in AI research—specifically the accessibility and feasibility of pre-training large AI models using limited computational resources typically available in academic settings. The authors conduct an empirical analysis to shed light on the often-intimidating resource barriers academic researchers face when engaging in pre-training experiments, which are usually compute-intensive.

Main Contributions

  1. Compute Resources Survey: The authors first conduct a survey to assess the typical compute resources available within academic institutions. They discover that academic researchers frequently have access to between 1 and 8 GPUs, usually for durations that span days to weeks. This survey highlights the disparity in computational resources between academia and industry, setting the stage for the subsequent empirical analysis.
  2. Benchmarking Training Configurations: The paper introduces a benchmark designed to measure the time required to pre-train models on different academic GPU setups. The benchmark was applied to several models and GPU configurations, accumulating approximately 2,000 GPU-hours of experimental data.
  3. Empirical Findings: Their empirical results provide an optimistic view compared to common pessimistic assumptions in academia. For example, the paper illustrates that replicating the Pythia-1B model, which originally required 64 GPUs over three days, can be achieved with 4 GPUs over 18 days using specific optimizations. This finding is critical as it dispels the notion that pre-training is entirely out of reach for under-resourced academic institutions.
  4. Optimization Strategies: The paper evaluates various efficient training methods (e.g., activation checkpointing, model sharding) and their combinations to minimize training times without altering model architecture or compromising the training recipe. This approach focuses on technical optimizations that do not change the model nor its theoretical training regimen, thereby maintaining the fidelity of the pre-training process.
  5. Cost-Benefit Analysis: The authors perform a pragmatic cost-benefit analysis to guide researchers on making informed decisions about hardware investments based on training time and cost. For example, they analyze situations where investing in high-end GPUs like 8 H100s can be justified by significant time savings, thus making them more cost-effective in the long run.

Implications and Future Directions

The findings of this paper have several implications for both the academic and industrial research communities. Practically, it provides academic researchers with actionable insights and a clear framework to approach the problem of pre-training large models on constrained budgets. It makes the case for re-thinking resource investments in university settings, suggesting that with careful planning, significant AI research endeavors are feasible.

Theoretically, the results encourage a shift in perspective within AI research, emphasizing that exploration of new models and architectures should not be monopolized by well-funded industry labs. Smaller, more principled investigations carried out by academia can contribute meaningfully to scientific progress, thereby ensuring a vibrant, diverse, and competitive research landscape.

In the future, similar studies could explore more recent hardware innovations and training methodologies. As AI models grow more complex, understanding the dynamic between available resources and the practicality of emerging techniques will be crucial. Furthermore, expanding the codebase to accommodate a wider array of models and sharing it as an extensible toolset invites both reproducibility and further experimentation within the community.

In conclusion, this paper demystifies the process of model pre-training under constrained resources, providing clarity and encouragement to pursue ambitious, controlled AI research within the academic ecosystem. Its contributions lay down a pathway for researchers aiming to maximize their limited resources while still engaging in cutting-edge AI model development.