Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism (2412.21124v2)

Published 30 Dec 2024 in cs.LG, math.OC, and stat.ML

Abstract: An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in LLM training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of LLMs, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of LLMs of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama 2 family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

Summary

  • The paper proposes an adaptive batch size method that dynamically adjusts batches to balance training efficiency and model generalization.
  • It leverages a theoretical framework within PyTorch FSDP to overcome memory constraints and support training models with billions of parameters.
  • Empirical results show superior performance over static batch sizing, especially for language models up to three billion parameters.

An Examination of Adaptive Batch Size Schedules for Distributed Training with LLMs

This paper presents an influential investigation into adaptive batch size schedules designed to enhance the computational efficiency and memory utilization for distributed training of large-scale LLMs. Leveraging data and model parallelism, the authors propose a theoretical framework and practical implementation for adaptive batch sizing within the PyTorch Fully Sharded Data Parallel (FSDP) landscape.

Overview of Challenges and Proposed Solutions

In the context of large-scale model training, the dilemma of optimizing between training efficiency (facilitated by large-batch methodologies) and maintaining generalization performance (where smaller batches often prevail) is well-documented. Despite the conventional emphasis on training efficiency, the static nature of batch sizing during training often results in suboptimal outcomes. To address this, the authors advocate for dynamically adaptive batch size schedules, which accommodate changing training dynamics and are compatible with both data and model parallelism.

The research introduces a novel method that aligns with established norms of PyTorch FSDP, thus supporting model training with an extensive parameter range exceeding billions. Key to this development is avoiding the constraints of vanilla data parallelism, which demands replicative storage of model parameters across workstations, thereby imposing memory limitations. This is achieved by integrating adaptive sampling methods within existing parallelism frameworks, optimizing efficient memory usage without sacrificing training scale.

Empirical Validation and Theoretical Insights

The authors empirically validate their approach by demonstrating superior performance over traditional constant and heuristic batch sizing tactics across various LLM families, including the Llama models. Notable improvements are observed particularly within models of up to three billion parameters. Adaptive schedules have shown to optimize model generalization without the excessive intervention required by fixed heuristic methodologies like batch size warmup.

Moreover, the researchers establish rigorous theoretical guarantees by presenting convergence analysis for their adaptive batch size method, specifically within the Adam optimizer framework. This analysis demonstrates convergence for nonconvex smooth objectives under the tested conditions, bridging a critical knowledge gap in the literature and bolstering the theoretical reliability of their proposal.

Implications and Prospective Developments

The proposed adaptive batch sizing approach yields important practical implications for AI researchers and practitioners aiming to optimize the resource-intensive pretraining of LLMs. By dynamically adjusting batch sizes, practitioners can better allocate computational resources, reduce training time, and enhance the performance of fully trained models.

Looking forward, the implications of this work suggest promising advancements, particularly with the increasing deployment of transformer models in natural language processing. Extending the implementation to more sophisticated parallel architectures and exploring scaling laws related to critical batch sizes offer avenues for further research. There also exists potential for incorporating these adaptive methods into broader domains, including vision transformers and autoregressive image models, thereby expanding their applicability across diverse AI fields.

In conclusion, this paper provides a robust framework and implementation strategy for adaptive batch size schedules in distributed model training. Through a combination of empirical evidence and theoretical analysis, it contributes to the ongoing discourse on optimizing LLM training, presenting a pathway towards more efficient and scalable AI systems.