Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Computational Bottlenecks of Training Small-scale Large Language Models (2410.19456v2)

Published 25 Oct 2024 in cs.LG
Computational Bottlenecks of Training Small-scale Large Language Models

Abstract: While LLMs dominate the AI landscape, Small-scale LLMs (SLMs) are gaining attention due to cost and efficiency demands from consumers. However, there is limited research on the training behavior and computational requirements of SLMs. In this study, we explore the computational bottlenecks of training SLMs (up to 2B parameters) by examining the effects of various hyperparameters and configurations, including GPU type, batch size, model size, communication protocol, attention type, and the number of GPUs. We assess these factors on popular cloud services using metrics such as loss per dollar and tokens per second. Our findings aim to support the broader adoption and optimization of LLM training for low-resource AI research institutes.

Analyzing the Computational Bottlenecks in Training Small-scale LLMs

The paper "Computational Bottlenecks of Training Small-scale LLMs" addresses an emerging area of research in the field of artificial intelligence, specifically focusing on the training dynamics and computational challenges of Small-scale LLMs (SLMs) with up to 2 billion parameters. This work examines various hyperparameters and hardware configurations, crucial for optimizing SLM training, particularly in environments constrained by computational resources and budget.

Key Findings

The research identifies several bottlenecks and offers recommendations for effective training:

  1. FlashAttention Significance: FlashAttention (FA) is highlighted as a critical component for enhancing cost-efficiency in training. The data suggests FA is particularly advantageous for smaller models due to its impact on data movement bottlenecks. The use of FA allows for larger batch sizes without the risk of out-of-memory (OOM) errors, a limitation often encountered with vanilla attention mechanisms. This is particularly relevant given the quadratic computational complexity of attention operations.
  2. Hardware Configuration: The paper compares different GPU types, revealing that A100-40GB GPUs are sufficient for smaller models, while A100-80GB GPUs are preferred for larger models and configurations with many GPUs. Noteworthy is the finding that more advanced and expensive hardware, such as H100-80GB GPUs, do not necessarily provide proportional cost benefits for SLMs, emphasizing a nuanced approach to hardware selection.
  3. Distributed Training Schemes: Analysis of communication strategies concludes that Distributed Data Parallel (DDP) is most effective for smaller models, while Fully Sharded Data Parallel (FSDP) shows advantages for larger configurations. The benefits of FSDP, particularly the Grad+Optimizer variant, are largely due to reduced communication overheads, enabling efficient scaling with fewer resources.
  4. Optimization Metrics: The paper proposes using "loss per dollar" and "tokens per dollar" as metrics to evaluate training efficiency. These metrics incorporate both model accuracy and financial cost, offering a practical framework for research institutions operating under budgetary constraints.

Implications

This research provides critical insights into the resource-efficient deployment of SLMs, which can cater to environments with limited computational budgets, such as small businesses and academic institutions. The findings advocate for adopting training methodologies that challenge the prevailing focus on extensive hardware resources, shifting the narrative towards efficiency and accessibility.

Future Prospects

The implications of this paper are substantial, suggesting shifts in AI research and deployment strategies. As the landscape evolves, the adoption of SLMs is likely to expand, supported by hardware and software advances that further optimize cost-performance ratios. The exploration of additional metrics related to CPU/GPU utilization and memory bandwidth usage presents opportunities for future research.

Overall, the paper systematically unpacks the complexities of training SLMs, providing a significant contribution to the discourse on cost-efficiency in AI. The emphasis on targeted hardware choices and innovative attention mechanisms anticipates a landscape where smaller, efficient models play a vital role without compromising on performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024a.
  3. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024b.
  4. Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling. arXiv preprint arXiv:2408.16737, 2024.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  7. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning (2023). arXiv preprint arXiv:2307.08691, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  10. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 2021.
  11. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, 2021.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  14. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  15. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
  16. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=EIGbXbxcUQ.
  17. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024.
  18. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  19. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  21. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  22. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
  23. Sheared llama: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=09iOdaeOzp.
  24. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  25. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  26. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Saleh Ashkboos (20 papers)
  2. Iman Mirzadeh (11 papers)
  3. Keivan Alizadeh (8 papers)
  4. Mohammad Hossein Sekhavat (4 papers)
  5. Moin Nabi (44 papers)
  6. Mehrdad Farajtabar (56 papers)
  7. Fartash Faghri (32 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com