Influence of learning rate scheduling on the Stage 1 vs. Stage 2 training gap
Ascertain how learning rate scheduling affects the observed narrowing of the bits-per-byte gap between runs that include Stage 1 subword-to-byte distillation and runs that begin directly with Stage 2 end-to-end training, and determine whether these results extrapolate to larger pretraining token budgets.
Sponsor
References
There are two main takeaways: (i) the 1B model benefits more from Stage 1 training than 7B, indicating that larger models may be more robust to catastrophic forgetting through large gradients at the start of training when starting directly with Stage 2, and (ii) the bits-per-byte gap narrows throughout the training trajectory but remains in favor of adding Stage 1; it is not clear how this behavior is influenced by the learning rate scheduling so we cannot easily extrapolate to higher token budgets.