Critical Sharpness in Deep Learning & Percolation
- Critical sharpness is the phenomenon where deep learning models shift from stable optimization to oscillatory behavior and percolation systems undergo a rapid phase transition from non-percolation to percolation.
- It is quantified in deep learning by the maximum Hessian eigenvalue and its scalable surrogate λc, and in percolation theory by sharp threshold inequalities that indicate abrupt connectivity changes.
- Understanding critical sharpness enables practitioners to adjust learning rates, control stability in neural networks, and interpret phase transitions in statistical physics.
Critical sharpness refers to distinct threshold phenomena in both statistical physics (percolation theory) and deep learning theory, where a system transitions sharply from one regime to another: in percolation, from non-percolating to percolating phases; in neural networks, from stable optimization to edge-of-stability dynamics. Although originating from different mathematical and empirical traditions, both share geometric and statistical underpinnings rooted in the abrupt change of connectivity or training dynamics at critical parameter values.
1. Critical Sharpness in Deep Learning: Definitions and Quantitative Frameworks
The central object in the deep learning context is the sharpness of a minimum θ of the loss landscape, classically quantified by the maximum eigenvalue of the Hessian, , which measures local curvature. Critical sharpness demarcates the onset of instability in iterative optimization, particularly first-order methods such as (stochastic) gradient descent (SGD).
For full-batch gradient descent with fixed step size , the classical quadratic stability analysis yields convergence of all modes if and only if for all Hessian eigenvalues , i.e., . Empirical studies confirm that, during training, "progressive sharpening" occurs until this threshold is reached, at which point the Hessian spectrum plateaus and the optimization transitions to a non-monotonic, oscillatory regime, termed the "edge-of-stability" (EOS). Thus, the critical sharpness is precisely (Roulet et al., 2023, Yoo et al., 7 Jun 2025).
For large-scale deep networks, direct estimation of is impractical. The "critical sharpness" measure is introduced as an efficient surrogate, formally defined as:
Here, is the update direction, e.g., the gradient or AdamW preconditioned direction. approximates the threshold beyond which loss increases along and is computable via a handful of forward loss evaluations, making it scalable to large models (Kalra et al., 23 Jan 2026).
2. Dynamics and Empirical Manifestations of Critical Sharpness
Deep networks trained with non-adaptive (fixed) step sizes universally exhibit progressive sharpening: the leading Hessian eigenvalue increases with training until plateauing at the critical value , marking the EOS regime where optimization oscillates around this sharpness. Stochastic optimization and the internal structure of the data or network affect the specific trajectory but not the presence of EOS.
Minibatch size, learning rate, data difficulty, and network depth all modulate critical sharpness:
- Learning Rate : , so higher yields a lower critical sharpness threshold.
- Depth : Deeper networks have increased limiting sharpness, amplifying the progressive sharpening effect.
- Batch Size / Stochasticity: Larger batch sizes or reduced gradient noise produce higher critical sharpness at convergence; SGD noise lowers the observed value, as explained by modifications to the layer imbalance dynamics (Yoo et al., 7 Jun 2025).
- Dataset Difficulty : Larger (sum of squared label-to-feature ratios) correlates with higher ultimate sharpness at convergence.
These effects are robustly reproduced in both full networks and minimalist linear models, with closed-form expressions for predicted sharpness matching empirical observations (Yoo et al., 7 Jun 2025).
3. Methodologies for Measuring Critical Sharpness and Scale-Invariance
Several methodologies exist for quantifying and interpreting sharpness and its critical regime:
- Largest Hessian Eigenvalue: The direct but computationally expensive approach to measuring sharpness, revealing EOS dynamics.
- Critical Sharpness : Scalable binary search–based estimate involving loss evaluations along update directions, capturing all known sharpness dynamics including EOS and progressive sharpening (Kalra et al., 23 Jan 2026).
- Directional/Layer-wise Sharpness: Measurement of sharpness along selected directions or per-layer, critical for identifying bottleneck ("straggler") layers that control generalization and stability (Abdollahpoorrostam, 2024).
- Minimum Sharpness: For ReLU and other scale-invariant networks, a closed-form, layer-aggregation sharpness measure,
ensures invariance under positive homogeneous transformations and correlates with generalization (Ibayashi et al., 2021).
When tracking critical sharpness at training scale, it is recommended to estimate at regular intervals, compare it to , and adapt learning rates or data mixing protocols accordingly (Kalra et al., 23 Jan 2026).
4. Algorithmic and Optimization Implications of Critical Sharpness
Critical sharpness constrains the maximal step size for stable descent. Learning-rate schedulers and linesearch techniques interact nontrivially with the emergent curvature:
- Armijo Linesearch: Tends to pick sub-EOS step sizes, leading to excessive sharpness growth and suboptimal loss decrease (Roulet et al., 2023).
- Polyak Step-size: Operates by design at, or just above, the EOS threshold, yielding faster convergence and tighter control of critical sharpness.
- Adaptive Schemes: Hybrid and curvature-adaptive tuners that explicitly monitor or control the condition may yield improved optimization trajectories, as stepwise stabilization at the EOS mitigates the risk of explosive sharpness and non-convergence.
Practically, monitoring critical sharpness during training permits safe, aggressive learning-rate scheduling, early-warning for loss landscape instability, and principled tuning of optimizer and data-mix parameters (Kalra et al., 23 Jan 2026).
5. Critical Sharpness and Generalization: Empirical Observations and Layer-Wise Phenomena
Classic theoretical intuition associates flat minima (low sharpness) with improved generalization. In simple architectures and small-scale networks, empirical evidence supports a strong negative correlation between sharpness and generalization gap. However, in modern architectures such as transformer-based foundation models (e.g., CLIP), this association breaks down: global sharpness does not reliably predict generalization on out-of-distribution (OOD) data. Instead:
- Straggler Layers: Interpolated models may contain specific "straggler" layers whose layer-wise sharpness dips near zero at the interpolation optimum. These layers, not global sharpness, tightly govern generalization behavior and OOD accuracy for robust fine-tuning protocols (Abdollahpoorrostam, 2024).
- Layer-wise Interventions: By inducing sparsity in straggler layers identified by critical sharpness analysis, generalization failures ("failure modes") during robust fine-tuning are systematically eliminated, restoring OOD gains and supporting a layer-localized version of the "flat minima" generalization principle.
Hence, sharpness-generalization relationships must be interpreted in a layer-wise and regime-specific manner, particularly in overparameterized models.
6. Critical Sharpness in Percolation Theory: Phase Transitions and Sharp Thresholds
In statistical physics, "critical sharpness" denotes the abruptness of the phase transition in percolation, either on discrete graphs (Bernoulli percolation) or continuous media (Poisson Boolean percolation). For Bernoulli bond percolation on infinite vertex-transitive graphs, the critical threshold marks the probability at which an infinite open cluster emerges. The transition is said to be "sharp" if subcritical connectivity probabilities decay exponentially and the infinite cluster appears promptly above (Vanneuville, 2022).
Key results:
- Sharpness Inequalities: Theorems establish that, for , the probability of a connection from origin to distance decays as , and for , the supercritical connection probability is lower bounded linearly.
- Extension to Poisson Boolean Percolation: In continuum models with power-law distributed radii, subcritical "almost sharpness"—exponential decay up to a power-law correction—holds for all but a countable set of parameters, under minimal moment assumptions (Dembin et al., 2022).
The sharpness of the phase transition is thus a universal feature characterized by rapid qualitative change in connectivity or component size as parameters cross critical values.
Summary Table: Critical Sharpness Across Domains
| Domain | Critical Sharpness Manifestation | Quantitative Threshold |
|---|---|---|
| Deep Learning (GD/SGD) | Onset of optimization instability | |
| LLMs (scalable estimates) | rises to EOS, tracks | , measured efficiently |
| CLIP / Foundation Models | Straggler layer’s sharpness controls OOD | Layer-wise sharpness near zero signals failure |
| Bernoulli Percolation | Exponential/subcritical decay, mean-field onset | ; decays sharply for |
| Boolean Poisson Percolation | Exponential decay up to power-law, fast slab percolation | , with sharp threshold for radii exponents |
Critical sharpness thus provides a unifying framework for threshold phenomena wherein sharp transitions in system behavior are governed by local curvature, spectral, or connectivity properties. In both high-dimensional optimization and statistical mechanics, identifying and controlling critical sharpness is fundamental for predicting stability, generalization, and phase characteristics (Ibayashi et al., 2021, Kalra et al., 23 Jan 2026, Abdollahpoorrostam, 2024, Roulet et al., 2023, Yoo et al., 7 Jun 2025, Vanneuville, 2022, Dembin et al., 2022).