Papers
Topics
Authors
Recent
Search
2000 character limit reached

Critical Sharpness in Deep Learning & Percolation

Updated 27 January 2026
  • Critical sharpness is the phenomenon where deep learning models shift from stable optimization to oscillatory behavior and percolation systems undergo a rapid phase transition from non-percolation to percolation.
  • It is quantified in deep learning by the maximum Hessian eigenvalue and its scalable surrogate λc, and in percolation theory by sharp threshold inequalities that indicate abrupt connectivity changes.
  • Understanding critical sharpness enables practitioners to adjust learning rates, control stability in neural networks, and interpret phase transitions in statistical physics.

Critical sharpness refers to distinct threshold phenomena in both statistical physics (percolation theory) and deep learning theory, where a system transitions sharply from one regime to another: in percolation, from non-percolating to percolating phases; in neural networks, from stable optimization to edge-of-stability dynamics. Although originating from different mathematical and empirical traditions, both share geometric and statistical underpinnings rooted in the abrupt change of connectivity or training dynamics at critical parameter values.

1. Critical Sharpness in Deep Learning: Definitions and Quantitative Frameworks

The central object in the deep learning context is the sharpness of a minimum θ of the loss landscape, classically quantified by the maximum eigenvalue of the Hessian, λmax(2L(θ))\lambda_{\max}(\nabla^2 L(\theta)), which measures local curvature. Critical sharpness demarcates the onset of instability in iterative optimization, particularly first-order methods such as (stochastic) gradient descent (SGD).

For full-batch gradient descent with fixed step size η\eta, the classical quadratic stability analysis yields convergence of all modes if and only if γλj<2\gamma\,\lambda_j < 2 for all Hessian eigenvalues λj\lambda_j, i.e., λmax2/η\lambda_{\max} \leq 2/\eta. Empirical studies confirm that, during training, "progressive sharpening" occurs until this threshold is reached, at which point the Hessian spectrum plateaus and the optimization transitions to a non-monotonic, oscillatory regime, termed the "edge-of-stability" (EOS). Thus, the critical sharpness is precisely Sc=2/ηS_c = 2/\eta (Roulet et al., 2023, Yoo et al., 7 Jun 2025).

For large-scale deep networks, direct estimation of λmax\lambda_{\max} is impractical. The "critical sharpness" measure λc\lambda_c is introduced as an efficient surrogate, formally defined as:

ηc=min{η>0:L(θηΔθ)>L(θ)},λc=2ηc\eta_c = \min \{ \eta > 0 : L(\theta - \eta \Delta\theta) > L(\theta) \}, \quad \lambda_c = \frac{2}{\eta_c}

Here, Δθ\Delta\theta is the update direction, e.g., the gradient or AdamW preconditioned direction. λc\lambda_c approximates the threshold beyond which loss increases along Δθ\Delta\theta and is computable via a handful of forward loss evaluations, making it scalable to large models (Kalra et al., 23 Jan 2026).

2. Dynamics and Empirical Manifestations of Critical Sharpness

Deep networks trained with non-adaptive (fixed) step sizes universally exhibit progressive sharpening: the leading Hessian eigenvalue increases with training until plateauing at the critical value 2/η2/\eta, marking the EOS regime where optimization oscillates around this sharpness. Stochastic optimization and the internal structure of the data or network affect the specific trajectory but not the presence of EOS.

Minibatch size, learning rate, data difficulty, and network depth all modulate critical sharpness:

  • Learning Rate η\eta: Sc=2/ηS_c=2/\eta, so higher η\eta yields a lower critical sharpness threshold.
  • Depth DD: Deeper networks have increased limiting sharpness, amplifying the progressive sharpening effect.
  • Batch Size BB / Stochasticity: Larger batch sizes or reduced gradient noise produce higher critical sharpness at convergence; SGD noise lowers the observed value, as explained by modifications to the layer imbalance dynamics (Yoo et al., 7 Jun 2025).
  • Dataset Difficulty QQ: Larger QQ (sum of squared label-to-feature ratios) correlates with higher ultimate sharpness at convergence.

These effects are robustly reproduced in both full networks and minimalist linear models, with closed-form expressions for predicted sharpness matching empirical observations (Yoo et al., 7 Jun 2025).

3. Methodologies for Measuring Critical Sharpness and Scale-Invariance

Several methodologies exist for quantifying and interpreting sharpness and its critical regime:

  • Largest Hessian Eigenvalue: The direct but computationally expensive approach to measuring sharpness, revealing EOS dynamics.
  • Critical Sharpness λc\lambda_c: Scalable binary search–based estimate involving loss evaluations along update directions, capturing all known sharpness dynamics including EOS and progressive sharpening (Kalra et al., 23 Jan 2026).
  • Directional/Layer-wise Sharpness: Measurement of sharpness along selected directions or per-layer, critical for identifying bottleneck ("straggler") layers that control generalization and stability (Abdollahpoorrostam, 2024).
  • Minimum Sharpness: For ReLU and other scale-invariant networks, a closed-form, layer-aggregation sharpness measure,

MSθ=D(d=1DTr[Hθ,d])1/D\text{MS}_\theta = D\, \left(\prod_{d=1}^D \mathrm{Tr}[H_{\theta, d}]\right)^{1/D}

ensures invariance under positive homogeneous transformations and correlates with generalization (Ibayashi et al., 2021).

When tracking critical sharpness at training scale, it is recommended to estimate λc\lambda_c at regular intervals, compare it to 2/η2/\eta, and adapt learning rates or data mixing protocols accordingly (Kalra et al., 23 Jan 2026).

4. Algorithmic and Optimization Implications of Critical Sharpness

Critical sharpness constrains the maximal step size for stable descent. Learning-rate schedulers and linesearch techniques interact nontrivially with the emergent curvature:

  • Armijo Linesearch: Tends to pick sub-EOS step sizes, leading to excessive sharpness growth and suboptimal loss decrease (Roulet et al., 2023).
  • Polyak Step-size: Operates by design at, or just above, the EOS threshold, yielding faster convergence and tighter control of critical sharpness.
  • Adaptive Schemes: Hybrid and curvature-adaptive tuners that explicitly monitor or control the γtλmax(t)2\gamma_t\,\lambda_{\max}(t)\approx2 condition may yield improved optimization trajectories, as stepwise stabilization at the EOS mitigates the risk of explosive sharpness and non-convergence.

Practically, monitoring critical sharpness during training permits safe, aggressive learning-rate scheduling, early-warning for loss landscape instability, and principled tuning of optimizer and data-mix parameters (Kalra et al., 23 Jan 2026).

5. Critical Sharpness and Generalization: Empirical Observations and Layer-Wise Phenomena

Classic theoretical intuition associates flat minima (low sharpness) with improved generalization. In simple architectures and small-scale networks, empirical evidence supports a strong negative correlation between sharpness and generalization gap. However, in modern architectures such as transformer-based foundation models (e.g., CLIP), this association breaks down: global sharpness does not reliably predict generalization on out-of-distribution (OOD) data. Instead:

  • Straggler Layers: Interpolated models may contain specific "straggler" layers whose layer-wise sharpness dips near zero at the interpolation optimum. These layers, not global sharpness, tightly govern generalization behavior and OOD accuracy for robust fine-tuning protocols (Abdollahpoorrostam, 2024).
  • Layer-wise Interventions: By inducing sparsity in straggler layers identified by critical sharpness analysis, generalization failures ("failure modes") during robust fine-tuning are systematically eliminated, restoring OOD gains and supporting a layer-localized version of the "flat minima" generalization principle.

Hence, sharpness-generalization relationships must be interpreted in a layer-wise and regime-specific manner, particularly in overparameterized models.

6. Critical Sharpness in Percolation Theory: Phase Transitions and Sharp Thresholds

In statistical physics, "critical sharpness" denotes the abruptness of the phase transition in percolation, either on discrete graphs (Bernoulli percolation) or continuous media (Poisson Boolean percolation). For Bernoulli bond percolation on infinite vertex-transitive graphs, the critical threshold pcp_c marks the probability pp at which an infinite open cluster emerges. The transition is said to be "sharp" if subcritical connectivity probabilities decay exponentially and the infinite cluster appears promptly above pcp_c (Vanneuville, 2022).

Key results:

  • Sharpness Inequalities: Theorems establish that, for p<pcp < p_c, the probability θn(p)\theta_n(p) of a connection from origin to distance nn decays as C2n/mC \cdot 2^{-n/m}, and for p>pcp > p_c, the supercritical connection probability is lower bounded linearly.
  • Extension to Poisson Boolean Percolation: In continuum models with power-law distributed radii, subcritical "almost sharpness"—exponential decay up to a power-law correction—holds for all but a countable set of parameters, under minimal moment assumptions (Dembin et al., 2022).

The sharpness of the phase transition is thus a universal feature characterized by rapid qualitative change in connectivity or component size as parameters cross critical values.


Summary Table: Critical Sharpness Across Domains

Domain Critical Sharpness Manifestation Quantitative Threshold
Deep Learning (GD/SGD) Onset of optimization instability λmaxSc=2/η\lambda_{\max} \simeq S_c = 2/\eta
LLMs (scalable estimates) λc\lambda_c rises to EOS, tracks λmax\lambda_{\max} λc2/η\lambda_c \simeq 2/\eta, measured efficiently
CLIP / Foundation Models Straggler layer’s sharpness controls OOD Layer-wise sharpness near zero signals failure
Bernoulli Percolation Exponential/subcritical decay, mean-field onset pcp_c; θn(p)\theta_n(p) decays sharply for p<pcp<p_c
Boolean Poisson Percolation Exponential decay up to power-law, fast slab percolation λc(μ)\lambda_c(\mu), with sharp threshold for radii exponents

Critical sharpness thus provides a unifying framework for threshold phenomena wherein sharp transitions in system behavior are governed by local curvature, spectral, or connectivity properties. In both high-dimensional optimization and statistical mechanics, identifying and controlling critical sharpness is fundamental for predicting stability, generalization, and phase characteristics (Ibayashi et al., 2021, Kalra et al., 23 Jan 2026, Abdollahpoorrostam, 2024, Roulet et al., 2023, Yoo et al., 7 Jun 2025, Vanneuville, 2022, Dembin et al., 2022).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Critical Sharpness.