Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Flatness After All? (2506.17809v1)

Published 21 Jun 2025 in cs.LG and stat.ML

Abstract: Recent literature has examined the relationship between the curvature of the loss function at minima and generalization, mainly in the context of overparameterized networks. A key observation is that "flat" minima tend to generalize better than "sharp" minima. While this idea is supported by empirical evidence, it has also been shown that deep networks can generalize even with arbitrary sharpness, as measured by either the trace or the spectral norm of the Hessian. In this paper, we argue that generalization could be assessed by measuring flatness using a soft rank measure of the Hessian. We show that when the common neural network model (neural network with exponential family negative log likelihood loss) is calibrated, and its prediction error and its confidence in the prediction are not correlated with the first and the second derivatives of the network's output, our measure accurately captures the asymptotic expected generalization gap. For non-calibrated models, we connect our flatness measure to the well-known Takeuchi Information Criterion and show that it still provides reliable estimates of generalization gaps for models that are not overly confident. Experimental results indicate that our approach offers a robust estimate of the generalization gap compared to baselines.

Summary

  • The paper introduces the soft rank of the Hessian as a novel, theoretically grounded measure of flatness that robustly predicts generalization in overparameterized neural networks, outperforming traditional metrics.
  • It provides theoretical justification, proving that for calibrated models, the soft rank exactly characterizes the asymptotic expected generalization gap under Tikhonov regularization, connecting it to information criteria.
  • Extensive experiments demonstrate the soft rank's strong correlation (Kendall τ ≈ 0.84) with the generalization gap, enabling its use for model selection, hyperparameter tuning, and early stopping with efficient approximations.

Flatness After All? — A Critical Analysis

The paper "Flatness After All?" (2506.17809) revisits the longstanding debate on the relationship between the curvature of the loss landscape at minima—commonly referred to as "flatness" or "sharpness"—and the generalization properties of overparameterized neural networks. The authors challenge prevailing metrics and propose a theoretically grounded alternative, the soft rank of the Hessian, as a robust measure of flatness that more accurately predicts generalization gaps under realistic training conditions.

Context and Motivation

Empirical and theoretical studies have often linked flat minima to improved generalization in deep learning. Traditional measures, such as the spectral norm or trace of the Hessian, have been widely used to quantify flatness. However, recent work has demonstrated that these metrics can be misleading: deep networks can generalize well even at sharp minima, and the correlation between Hessian-based sharpness and generalization gap is inconsistent across architectures and datasets. Moreover, the invariance of these measures under reparametrization is questionable, and their connection to established statistical criteria (e.g., AIC, TIC) is not straightforward.

Main Contributions

The authors make several key contributions:

  1. Soft Rank as a Flatness Measure:

They introduce the soft rank of the Hessian, defined as

1
rank_λ(H) = Tr(H (H + λI)^(-1))
where λ is the weight decay coefficient. This measure interpolates between the rank of the Hessian (as λ → 0) and the trace-based flatness measures (for large λ), providing a continuous, monotonic, and concave functional on the positive semidefinite cone.

  1. Theoretical Justification: For neural networks trained with exponential family negative log-likelihood loss, the authors prove that if the model is calibrated and the prediction error/confidence is uncorrelated with the first and second derivatives of the output, the soft rank exactly characterizes the asymptotic expected generalization gap. This result holds under Tikhonov (ridge) regularization and is robust to the choice of λ.
  2. Extension to Non-Calibrated Models: For non-calibrated models, the generalization gap is shown to be governed by both the soft rank and a ratio involving the trace of the gradient covariance and the Fisher information. The authors connect this to the Takeuchi Information Criterion (TIC), clarifying when and why previous trace-based heuristics fail.
  3. Empirical Validation: Extensive experiments on MNIST, CIFAR-10, and SVHN with various architectures and hyperparameters demonstrate that the soft rank, even when estimated from training data, correlates strongly with the generalization gap (Kendall τ ≈ 0.84). In contrast, the trace and spectral norm of the Hessian are much less predictive.
  4. Efficient Estimation: The paper addresses the computational cost of estimating the soft rank for large models by leveraging Kronecker-Factored Approximate Curvature (KFAC) and diagonal approximations. Theoretical results guarantee that these approximations provide upper bounds on the true soft rank, and empirical results confirm their practical utility.

Numerical Results and Claims

  • Strong Correlation: The soft rank of the Fisher Information Matrix (FIM) computed on training data achieves a Kendall τ of 0.84 with the generalization gap, outperforming the log-determinant (τ = 0.82), trace (τ = 0.28), and spectral norm (τ = 0.08).
  • Robustness to Overfitting: The soft rank remains a reliable predictor of the generalization gap except in extreme overfitting regimes, where the error-uncertainty ratio collapses.
  • Approximation Quality: KFAC-based soft rank estimators maintain high correlation with the true gap while being computationally tractable for modern architectures.

Theoretical and Practical Implications

Theoretical

  • Reconciliation of Contradictory Results: The analysis clarifies why sharp minima can generalize in the absence of regularization (as in Dinh et al., 2017), but not when weight decay is present. The soft rank is invariant to layerwise rescaling that leaves the loss unchanged but alters the Hessian's spectrum.
  • Geometry and Invariance: The paper emphasizes the importance of the underlying metric (inner product) in defining curvature. The soft rank, as a monotonic concave trace function, is more robust to reparametrization and aligns with information-geometric perspectives.
  • Connection to Bayesian Complexity: The soft rank is shown to be closely related to Bayesian marginal likelihood complexity terms (e.g., log-determinant of the regularized Hessian), both being concave trace functions.

Practical

  • Model Selection and Early Stopping: The soft rank can be used as a complexity regularizer or as a criterion for early stopping, especially in regimes where cross-validation is expensive or infeasible.
  • Hyperparameter Tuning: The measure provides a theoretically justified alternative to cross-validation for selecting weight decay and other regularization parameters.
  • Efficient Implementation: The KFAC and diagonal approximations enable scalable estimation of the soft rank in large-scale deep learning, making the approach practical for real-world applications.

Limitations and Open Questions

  • Calibration and Overfitting: The predictive power of the soft rank diminishes in highly overfitted or poorly calibrated models, as the error-uncertainty ratio becomes unreliable. The paper provides diagnostic tools to detect such regimes.
  • Extension Beyond Exponential Family: The theoretical results are derived for exponential family losses; generalization to arbitrary loss functions remains an open direction.
  • Interaction with Other Regularizers: The analysis focuses on Tikhonov regularization; the behavior under other forms of regularization (e.g., dropout, batch norm) warrants further paper.

Future Directions

  • Integration with Training Algorithms: Incorporating soft rank-based regularization or monitoring into optimizers could yield improved generalization and robustness.
  • Broader Applicability: Extending the framework to unsupervised, self-supervised, and reinforcement learning settings, where calibration and uncertainty estimation are more challenging.
  • Information-Geometric Optimization: The connection to information geometry suggests potential for new optimization algorithms that explicitly exploit the soft rank or related metrics.

Conclusion

This work provides a rigorous and practically relevant framework for understanding and quantifying flatness in deep learning. By grounding the notion of flatness in the soft rank of the Hessian and connecting it to both classical information criteria and Bayesian complexity, the paper advances both the theory and practice of generalization in overparameterized models. The results have immediate implications for model selection, regularization, and the design of scalable training procedures, and open new avenues for research at the intersection of geometry, statistics, and deep learning.