Cause of early generative frontier saturation at GPT-2 small scale

Determine whether the observed saturation of generative frontiers across training—specifically, the near-equivalence between frontiers at 50,000 and 1,000,000 pretraining steps for MDLM, Duo, and CANDI diffusion language models trained on OpenWebText at the GPT-2 small (approximately 150M parameters) scale—reflects a fundamental limitation of model capacity at this scale or a limitation of generative perplexity as an evaluation metric in capturing fine-grained differences in model capability.

Background

The paper introduces generative frontier analysis as a principled evaluation method for diffusion LLMs, showing that single-point metrics like generative perplexity and unigram entropy can be misleading. Using frontiers, the authors empirically observe that models pretrained for only 50,000 steps produce frontiers that are surprisingly similar to those of fully trained (1,000,000-step) checkpoints for MDLM, Duo, and CANDI on OpenWebText.

This raises an unresolved question about interpretation: whether the apparent early saturation of frontiers is due to intrinsic capacity limits at the GPT-2 small scale, or due to generative perplexity being insufficiently sensitive to fine-grained capability differences—an ambiguity the authors flag explicitly.

References

We also note a potential limitation: it is a commonly known fact that training for longer improves performance on downstream tasks, and it is not clear whether the saturation we observe here reflects a fundamental limitation of model capacity at this scale, or a limitation of generative perplexity in capturing fine-grained differences in model capability.

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models  (2604.02718 - Pynadath et al., 3 Apr 2026) in Section 5 (Empirical Observations)