Cause of early generative frontier saturation at GPT-2 small scale
Determine whether the observed saturation of generative frontiers across training—specifically, the near-equivalence between frontiers at 50,000 and 1,000,000 pretraining steps for MDLM, Duo, and CANDI diffusion language models trained on OpenWebText at the GPT-2 small (approximately 150M parameters) scale—reflects a fundamental limitation of model capacity at this scale or a limitation of generative perplexity as an evaluation metric in capturing fine-grained differences in model capability.
References
We also note a potential limitation: it is a commonly known fact that training for longer improves performance on downstream tasks, and it is not clear whether the saturation we observe here reflects a fundamental limitation of model capacity at this scale, or a limitation of generative perplexity in capturing fine-grained differences in model capability.