Extend the linear-theory analysis to full transformer architectures
Develop a theoretical analysis of gradient-descent optimization under cross-entropy loss for full transformer architectures that generalizes the simplified linear feature-extractor model (f_θ(s) = θ^T s with a linear classifier W) used to study representation geometry phase evolution.
References
Our findings have several limitations: (i) computational constraints limited our analysis to models up to 12B parameters, though the phases persist across scales from 160M to 12B; (ii) spectral metric computation requires ∼10K samples and scales quadratically with hidden dimension (iii) our theoretical analysis assumes simplified linear feature extractors, leaving the extension to full transformer architectures as future work; (iv) we focused on English-LLMs trained with standard objectives, and whether similar phases emerge in multilingual or alternatively-trained models remains unexplored.