Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 194 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It (2505.05409v1)

Published 8 May 2025 in cs.LG

Abstract: The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.

Summary

Transformer Symmetries Obscure Sharpness: Riemannian Geometry Reveals It

The paper explores the application of sharpness measures to predict neural network generalization, with a focus on transformer architectures. Traditionally, sharpness has been a useful metric for estimating generalization in simpler neural networks like MLPs and CNNs, often by correlating flatness with better performance across train and test datasets. In transformers, however, existing sharpness measures fail to yield meaningful correlations due to the complex symmetries inherent in their attention mechanisms.

Symmetries and Their Impacts

The authors assert that the main issue lies in the continuous and full parameter symmetries of transformers. These symmetries result in equivalent functions being represented by different parameter values, making traditional sharpness measures inadequate. Specifically, the symmetries lead to transformations in the parameter space that do not alter the loss function, hence causing ambiguities in sharpness calculations. The transformer architecture exhibits higher-dimensional $\GL(h)$ symmetries in their attention mechanisms, allowing transformation through invertible $h \times h$ matrices without affecting their output function.

The Riemannian Approach

To overcome these challenges, the authors redefine sharpness within a quotient manifold that accounts for transformer symmetries. Utilizing Riemannian geometry, they introduce a generalized metric of sharpness on this quotient manifold. By approximating geodesics in this corrected space, they demonstrate improved correlations with generalization. This refined approach is shown to recover significant correlation by incorporating higher-order terms ignored in simpler measures.

Empirical Evaluation

The paper presents theoretical validation alongside empirical evidence. Testing on synthetic diagonal networks and real-world transformers handling text and image data, the newly proposed geodesic sharpness shows strong correlation with generalization where traditional measures show none or weak correlation.

Implications

The implications of this research are twofold. Practically, this provides a tool for more effective prediction of generalization in sophisticated architectures like transformers. Theoretically, it opens up new avenues to explore the interplay of symmetry and geometry in deep learning models. The authors suggest that these insights could lead to more advanced regularization techniques during training, influencing the development of sharper predictive models.

Future Prospects

Looking ahead, the authors speculate that further exploration of symmetry-induced curvature on parameter spaces could yield additional geometric insights, potentially translating to broader applications across diverse neural architectures. Integrating geometric understandings into optimization strategies may pave the way for better performance and understanding of model behavior under varying conditions.

This paper substantiates the necessity of considering Riemannian geometry and symmetry in modern neural network architectures, offering a pathway toward more precise generalization metrics and possibly inspiring novel methodologies in AI research. The application of these principles to transformer technology underscores their role in innovation within machine learning.