Sharpness Dimension (SD)

Updated 22 April 2026

SD is a metric that quantifies the effective fractal dimension of random attractors in stochastic optimization, capturing the interplay of expansion and contraction via the full Hessian spectrum.
It utilizes one-step analogues of Lyapunov exponents to balance singular value dynamics, offering a spectrum-sensitive alternative to traditional sharpness or norm-based measures.
Empirical studies show that SD robustly correlates with generalization gaps across various neural network architectures and can guide optimization strategies like SAM.

The Sharpness Dimension (SD) is a metric that quantifies the effective fractal dimension of the discrete-time random attractor encountered in stochastic optimization of modern neural networks, particularly in regimes referred to as the "edge of stability" (EoS). SD provides a principled, spectrum-sensitive alternative to classical pointwise sharpness or norm-based measures, capturing the complexity introduced by the full Hessian spectrum and the associated dynamics of stochastic optimizers. Its development is motivated by observed phenomena where large learning rates induce chaotic yet generalization-promoting optimization trajectories, with iterates converging not to a single point but to a measure-supported random set of lower intrinsic dimension (Tuci et al., 21 Apr 2026, Luo et al., 22 Sep 2025).

1. Formal Definition and Mathematical Structure

Let $\{\lambda_k\}_{k=1}^d$ denote the ordered RDS (Random Dynamical Systems) sharpness exponents of order $k$ , defined by

$\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$

where $\sigma_1 \geq \cdots \geq \sigma_d$ are the singular values of the Jacobian $D\phi(1,\omega,w)$ of the one-step update map $\phi$ at $w$ over the random attractor $\mathcal{A}(\omega)$ . The critical index $j^*$ is

$j^* := \max \left\{ i \in \{1,\dots,d\} : \sum_{k=1}^i \lambda_k \geq 0 \right\}$

(with $k$ 0 if $k$ 1). The Sharpness Dimension is then

$k$ 2

This definition interpolates between integer and fractional values depending on the balance of expansion and contraction rates along ordered singular value directions (Tuci et al., 21 Apr 2026).

2. Theoretical Motivation and Connection to Dynamical Systems

SGD and similar stochastic optimizers are modeled as discrete-time random dynamical systems (RDS), where the stochasticity from data mini-batching or label noise introduces randomness through the noise history $k$ 3. The parametric trajectory does not converge to a fixed point, but instead accumulates on a compact random set $k$ 4—the so-called pullback attractor.

The structure of $k$ 5 is fractal-like under global dissipativity and local EoS instability. Motivated by the Kaplan–Yorke (Lyapunov) dimension in deterministic systems, SD employs one-step analogues of Lyapunov exponents (expected logs of singular values) to balance partial expansion and contraction across dimensions. The largest $k$ 6 such that $k$ 7 determines the integer part of SD, and the fractional correction measures residual partial expansion in the next direction (Tuci et al., 21 Apr 2026).

In the context of sharpness-aware minimization (SAM) and related algorithms, m-sharpness—arising from stochastic gradient noise covariance—can be interpreted as an instance of SD in the SDE limit, where sharpness regularization per noise "dimension" is governed by $k$ 8 scaling for micro-batch size $k$ 9 (Luo et al., 22 Sep 2025).

3. Generalization Bounds and Statistical Guarantees

Under bounded, Lipschitz-continuous losses and regularity assumptions on the attractor and noise, the following generalization bound holds with high probability (over both dataset and noise draws): $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 0 where $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 1 is the worst-case generalization gap over the attractor, $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 2 is a uniform loss bound, $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 3 the Lipschitz constant, $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 4 the total mutual information between $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 5 and the dataset, and $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 6 the sample size. The effective complexity of the attractor, as captured by SD, directly controls the generalization error, in contrast to classical sharpness metrics that disregard attractor geometry (Tuci et al., 21 Apr 2026).

4. Role of the Full Hessian Spectrum

SD requires sensitivity to the complete Hessian spectrum at the attractor. Specifically, the definition is based on partial-volume growth rates, given by products of leading singular values, $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 7, at each step. This quantifies the maximal expansion of $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 8-dimensional volumes and incurs fractional corrections from contraction in the critical direction.

In contrast,

The trace of the Hessian aggregates eigenvalues indiscriminately, failing to distinguish whether sharpness is due to a few large or many moderate eigenvalues.
The spectral norm (largest eigenvalue) detects only the most unstable direction, without measuring expansion/contraction in higher-dimensional volumes.
SD, by incorporating all principal minors (partial determinants), identifies the precise transition between expansion and contraction, defining the effective (fractal) dimension of the attractor (Tuci et al., 21 Apr 2026).

The table below summarizes metric distinctions:

Metric	Spectrum Sensitivity	Captures Attractor Geometry
Trace of Hessian	No	No
Spectral Norm	No	No
SD (Sharpness Dimension)	Yes (full spectrum)	Yes

5. Empirical Findings Across Architectures and Regimes

Extensive empirical investigation supports the utility of SD:

On small multilayer perceptrons (MLPs) trained on MNIST, SD and variants (SD-SLQ) exhibited the highest correlation (Kendall $\lambda_k = \mathbb{E}\left[ \sup_{w \in \mathcal{A}(\omega)} \ln \sigma_k \left( D\phi(1, \omega, w) \right) \right],$ 9) with generalization and loss gaps when varying learning rate and batch size. Neither the Hessian trace nor top eigenvalue tracked generalization at EoS (Tuci et al., 21 Apr 2026).
On larger MLPs (e.g., 278,800 parameters), SD-SLQ outperformed all other indicators on hyperparameter grids; geometry-based measures were only competitive with single hyperparameter sweeps.
In modular arithmetic "grokking" experiments, SD exhibited sharp phase transitions precisely at the delayed test accuracy onset, where neither Hessian trace nor top sharpness captured the effect.
On GPT-2 (124M parameters, WikiText-2), SD-based indicators correlated consistently with loss gaps across SGD, SGD+momentum, and AdamW. Plain sharpness or trace could even correlate negatively under AdamW.

Moreover, in the context of SAM and its m-sharpness generalization, the variance-based sharpness regularization term scales as $\sigma_1 \geq \cdots \geq \sigma_d$ 0, and decreasing batch size $\sigma_1 \geq \cdots \geq \sigma_d$ 1 strengthens effective flatness, improving generalization. Reweighted SAM provides a practical approximation using importance-weighted sampling (Luo et al., 22 Sep 2025).

6. Methodological and Practical Implications

SD serves as a robust measure for tracking generalization complexity in regimes characterized by nontrivial attractor geometry. Its reliance on the full Hessian spectrum enables sensitivity to optimization-induced fractality not captured by pointwise or norm-based metrics. For large-scale models, efficient approximations such as stochastic Lanczos quadrature (SLQ) or kernel density estimation of the spectral density allow practical evaluation of SD (Tuci et al., 21 Apr 2026).

Guidelines for practice include:

For SAM-like algorithms, select the smallest micro-batch size $\sigma_1 \geq \cdots \geq \sigma_d$ 2 compatible with hardware constraints to amplify variance-based regularization, effectively increasing flatness in the sense of SD (Luo et al., 22 Sep 2025).
Estimate per-sample gradient norms using stochastic finite differences for scalable importance sampling in Reweighted SAM.
Empirically monitor the trace of gradient noise covariance as a proxy for sharpness regularization strength.

7. Broader Significance and Future Directions

The conceptualization of SD anchors generalization analysis in high-dimensional, noisy, and chaotic regimes beyond classical point-based minima. By connecting stochastic optimization in deep learning to fractal random attractors and Lyapunov dimension theory, SD provides a unifying language to describe the interplay between optimizer hyperparameters, attractor geometry, and generalization outcomes. This framework motivates further investigation into RDS-based complexity measures and their interaction with explicit regularization, optimization scale, and neural architecture (Tuci et al., 21 Apr 2026, Luo et al., 22 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Generalization at the Edge of Stability (2026)

Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sharpness Dimension (SD).