Optimal batch-size scheduling for minimizing serial runtime without performance loss
Determine the optimal batch size schedule for large language model pretraining that minimizes serial runtime while not sacrificing performance, i.e., identify a principled batch-size ramping policy that achieves the fastest possible training in wall-clock time while maintaining model quality comparable to standard training practices.
References
However, to the best of our knowledge, the "batch ramp" schedules are not theoretically grounded and instead tuned heuristically. The lack of theoretical justification leaves open whether these heuristics are close to optimal, motivating the central question of our study: what is the optimal batch size schedule for minimizing serial runtime while not sacrificing performance?
— Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
(2510.14717 - Meterez et al., 16 Oct 2025) in Section 1 (Introduction)