Presence of volume–data scaling structure in language modeling

Investigate whether the strong bias toward flat minima and the power-law relationship between minima basin volume and dataset size observed in image classification tasks also appear in language modeling, and characterize any similarities or differences across domains.

Background

For image classification, the authors observe a strong bias toward flat minima together with a power-law link between minima volume and dataset size. They speculate possible ties to broader theories (e.g., manifold hypothesis, neural scaling laws).

They explicitly state that it is unknown whether analogous structure exists in language modeling, motivating cross-domain analysis to understand the generality of volume-based explanations of generalization.

References

It is unknown whether similar structure appears in domains such as language modeling.

Sharp Minima Can Generalize: A Loss Landscape Perspective On Data (2511.04808 - Fan et al., 6 Nov 2025) in Conclusion