Optimal batch-size scheduling for minimizing serial runtime without performance loss

Determine the optimal batch size schedule for large language model pretraining that minimizes serial runtime while not sacrificing performance, i.e., identify a principled batch-size ramping policy that achieves the fastest possible training in wall-clock time while maintaining model quality comparable to standard training practices.

Background

Recent large-scale LLM training commonly employs batch size schedules that gradually increase batch size over training ("batch ramps"), but these schedules are typically heuristic and lack theoretical grounding. Establishing a principled, potentially optimal batch-size schedule would directly impact training efficiency and wall-clock time, especially given hardware constraints and scaling laws in pretraining.

The paper proposes Seesaw, a method motivated by theoretical equivalence results between learning-rate decay and batch-size ramping under certain conditions, and demonstrates empirical speedups. However, the broader question of optimality of batch scheduling—beyond the specific equivalence regimes and practical constraints—remains explicitly posed as a central open question.

References

However, to the best of our knowledge, the "batch ramp" schedules are not theoretically grounded and instead tuned heuristically. The lack of theoretical justification leaves open whether these heuristics are close to optimal, motivating the central question of our study: what is the optimal batch size schedule for minimizing serial runtime while not sacrificing performance?

— Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling (2510.14717 - Meterez et al., 16 Oct 2025) in Section 1 (Introduction)

Optimal batch-size scheduling for minimizing serial runtime without performance loss

Background

References

Related Problems