How to Set the Batch Size for Large-Scale Pre-training?

Published 8 Jan 2026 in cs.AI | (2601.05034v1)

Abstract: The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

Abstract PDF Chat (Pro)

Summary

The paper revises the Critical Batch Size theory under WSD LR schedules, introducing a new E(S) formulation that captures dynamic training phases.
It defines novel metrics Bmin and Bopt to determine optimal batch sizes, empirically demonstrating improved data efficiency and convergence.
Dynamic batch size scheduling is shown to outperform fixed-batch methods, leading to smoother loss convergence and enhanced downstream performance.

Revisiting Batch Size Scheduling for Large-Scale Model Pre-Training under WSD LR Schedules

Introduction

The effective configuration of batch size is a pivotal consideration in the optimization of large-scale pre-training for LLMs. Traditionally, the Critical Batch Size theory provided a principled framework for balancing data consumption and optimization steps under cosine learning rate schedules. However, the paradigm shift toward Warmup-Stable-Decay (WSD) learning rate schedulers has rendered the foundational assumptions and resulting scaling relations inapplicable. The paper "How to Set the Batch Size for Large-Scale Pre-training?" (2601.05034) addresses this theoretical and practical disconnect by developing a revised $E(S)$ relationship tailored to WSD-based training regimes, identifying new batch size metrics, and devising a dynamic batch size scheduler that empirically yields improved efficiency and downstream performance.

Breakdown of Critical Batch Size Theory under WSD LR Schedules

The Critical Batch Size framework posits a monotonic trade-off between token consumption $E$ and optimization steps $S$ for achieving a fixed target loss, a relationship well-captured by the $E(S)$ formula

$(\frac{E}{E_{min}}-1)\left(\frac{S}{S_{min}}-1\right)=1.$

This theory assumes a cosine or static learning rate schedule; however, empirical evidence shows that under WSD schedules, loss curves for different batch sizes intersect as training progresses. In particular, larger batch sizes may, at lower target losses, consume less data than smaller batch sizes—directly contradicting the monotonic ordering predicted by the classic $E(S)$ formula. This regime change is visually evident in loss curves (Figure 1).

Figure 1: Loss curves for models trained with different batch sizes under WSD schedule, illustrating breakdown of Critical Batch Size theory as the curves invert partial ordering post-intersection.

Novel $E(S)$ Formulation for WSD Schedules

To explain the observed phenomena, the authors construct a piecewise $E(S)$ relationship reflecting three distinct dynamic phases:

Inverse Linear Stage: $E$ varies inversely with $S-S_{min}$ .
Transition Stage: $E$ follows a quadratic function in $S$ .
Linear Stage: $E$ increases linearly with $S$ .

These phases collectively capture the trade-off between computation steps and data usage across batch sizes in WSD-scheduled training. The revised $E(S)$ employs parameter fitting subject to continuity and differentiability constraints, demonstrated to closely match empirical measurements across a suite of model sizes and batch size settings (Figure 2).

Figure 2: Fitting results of $E(S)$ for InternLM2-1B in the loss interval [2.93, 3.25], validating the accuracy of the new formulation.

Emergent Metrics: $B_{min}$ and $B_{opt}$

Analysis of the new $E(S)$ instantiates two core batch size metrics supplanting the old Critical Batch Size:

$B_{min}$ : The minimum batch size required to reach a particular target loss.
$B_{opt}$ : The batch size that optimizes data efficiency, minimizing token consumption to achieve the target loss.

Figure 3 shows that both $B_{min}$ and $B_{opt}$ increase monotonically as training loss decreases, providing the rationale for progressive batch size expansion during pre-training.

Figure 3: The scaling relationship shows that $B_{min}$ and $B_{opt}$ rise with decreasing target loss across model sizes.

Practical Batch Size Scheduling: Dynamic Expansion

Given the non-optimality of fixed batch sizes in the WSD regime, a dynamic batch size scheduling algorithm is derived that progressively expands the batch size, informed by the empirical curves of $B_{opt}$ relative to cumulative data consumption. This strategy is convex for maximizing efficiency and achieving deeper convergence.

Empirical studies executed with Qwen3-Dense and Qwen3-MoE architectures demonstrate that dynamic batch size schedules outperform fixed-batch baselines in training efficacy and downstream task results. In both cases—under constant learning rate regimes—the dynamic strategy yields smoother, accelerated loss convergence and higher MMLU/CMMLU scores (Figures 4-7).

Figure 4: Training loss trajectories under fixed vs. dynamic batch size scheduling for Qwen3 MoE model at constant learning rate.

Figure 5: Downstream benchmark results for Qwen3 MoE, confirming sustained superiority of dynamic scheduling.

Figure 6: Training loss curves for Qwen3 Dense comparing fixed with dynamic batch schedule approaches.

Figure 7: Comparative evaluation for Qwen3 Dense on downstream tasks, dynamic scheduling maintains higher scores.

Ablations and Robustness

Comprehensive ablations confirm the generality and adaptability of the dynamic batch size scheduler across:

Cosine LR Schedules: Dynamic batch size remains advantageous (Figure 8).
Learning Rate Scaling: Synchronous LR scaling with batch size offers no additional gains and may dilute gradient noise suppression (Figure 9).
Sequence Length Scaling vs. Micro-batch Expansion: Sequence length change introduces undesirable distribution shift and adaptation delays (Figure 10).
Weight Decay: The effectiveness of dynamic batch sizing depends critically on sufficient regularization strength (Figure 11).
Continued Training/Annealing: Dynamic scheme sustains its advantage into the annealing phase with decayed learning rates (Figure 12).
Figure 8: Dynamic batch size scheduling remains beneficial with cosine learning rate schedules.

Figure 9: No substantive improvement from learning rate scaling alongside batch size expansion.

Figure 10: Sequence length scaling causes performance drops not observed with micro-batch scaling.

Figure 11: Reduced weight decay diminishes advantage of dynamic batch size scheduling.

Figure 12: Dynamic scheduling's performance persists under learning rate annealing phase.

Theoretical and Practical Implications

The findings formally invalidate classic Critical Batch Size theory for current large-scale training practice employing WSD LR schedules. Crucially, efficiency-optimal batch size is not static but monotonically increases with training progression, motivating batch schedule dynamics. This realization impacts practical LLM pre-training: fixed batch size configurations are suboptimal, and dynamic expansion should be implemented to maximize both efficiency and model quality.

Theoretically, the suite of analytical results—tripartite $E(S)$ structure and associated constraints—provides a basis for further predictive modeling of training dynamics and hyperparameter scaling, advancing principled approaches to large-scale model optimization.

Future Directions

The paper flags notable open challenges, including generalization of $E(S)$ fitting across LR schedules, formal proof of global optimality for dynamic scheduling, and mitigation of distributional shifts associated with sequence length scaling. Addressing these will enable broader adoption and flexibility in batch scheduling paradigms.

Conclusion

This study robustly establishes the breakdown of previous batch size theory under WSD learning rate scheduling and provides a new theoretical and empirical basis for dynamic batch size scheduling in large-scale pre-training. The dynamic strategy yields tangible gains in efficiency and downstream model quality and should be incorporated into future generation foundation model pipelines. The investigative framework also opens new avenues for hyperparameter scaling law research as model scale and training corpus size continue to grow.