Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 135 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Noise-Adaptive Layerwise Learning Rates

Updated 20 October 2025
  • The paper introduces a noise-adaptive layerwise learning rate scheme that scales updates based on per-layer gradient noise using dual norm estimates.
  • It leverages local curvature and stochastic variability to adjust step sizes dynamically, enhancing convergence in architectures with heterogeneous layer sharpness.
  • Empirical evaluations on transformer models demonstrate up to 1.5× acceleration in training efficiency compared to geometry-aware baselines.

A noise-adaptive layerwise learning rate scheme is an optimization strategy for deep neural networks (DNNs) that dynamically adjusts learning rates at the granularity of individual layers in response to stochastic gradient noise and local geometric properties. Rather than using a global or group-wise constant, these methods estimate layer-level noise—often in the dual norm corresponding to the layer’s geometry—and scale updates such that step sizes reflect both curvature and instantaneous stochastic variability. This enhances convergence and robustness, particularly in architectures exhibiting heterogeneous layerwise sharpness or noise profiles, such as transformer models. The approach can be implemented on top of geometry-aware optimization methods, substantially accelerating training compared to conventional schemes with fixed rates across layers (Hao et al., 15 Oct 2025).

1. Rationale and Motivation

Deep networks comprise diverse layers tracking distinct features, nonlinearities, and norm structures; this manifests as pronounced heterogeneity in curvature (sharpness), gradient statistics, and stochastic noise magnitude across layers. Geometry-aware optimizers (e.g., Muon) apply norm-constrained updates tailored to parameter groups, yet within a group, the stochastic noise and curvature—measured in appropriate induced dual norms—still vary dynamically across layers and over training.

Recent empirical investigations have identified substantial discrepancies in gradient noise among individual transformer layers (e.g., MLP or attention submodules) and have linked fixed intra-group learning rates to inefficiencies: overconservative steps for "quieter" layers, and unstable or ineffective updates under high noise. This motivates a finer-grained, noise-adaptive layerwise approach: allocating per-layer learning rates that scale inversely with instantaneous noise levels, allowing for more aggressive updates where the signal is reliable and safer steps where the stochasticity is dominant (Hao et al., 15 Oct 2025).

2. Formal Layerwise Noise-Adaptive Scheme

The core methodology maintains, for each layer \ell at iteration tt, a running estimate HtH_{t}^\ell of the gradient noise magnitude, calculated in the dual norm associated with the layer's parameter geometry (induced by the Linear Minimization Oracle, LMO). HtH_{t}^\ell is updated via an exponential moving average of squared differences between successive gradients (or optionally, independent gradient estimates from the same batch), yielding

Ht=β2Ht1+(1β2)GtG~t2H_{t}^\ell = \beta_2 H_{t-1}^\ell + (1-\beta_2) \| G_t^\ell - \tilde{G}_t^\ell \|_*^2

where GtG_t^\ell is the current gradient, G~t\tilde{G}_t^\ell a reference gradient (for efficiency, often the previous gradient), and \| \cdot \|_* the relevant dual norm.

The effective noise-adaptive scaling factor αt\alpha_{t}^\ell is: αt=αα2+Ht\alpha_{t}^\ell = \frac{\alpha}{\sqrt{\alpha^2 + H_t^\ell}} resulting in a reduced step size when HtH_t^\ell is large (high noise), and larger steps in low-noise regimes.

To harmonize updates across layers, the learning rate for layer \ell is normalized within a parameter group: ηt=ηtαtαtm\eta_{t}^\ell = \eta_{t} \frac{\alpha_t^\ell}{\alpha_t^m} where αtm\alpha_t^m is the maximum scaling factor among all layers in the group.

The update direction is determined by the norm-constrained LMO and scaled accordingly, preserving geometry-aware properties while integrating per-layer stochastic noise adaptation.

3. Theoretical Guarantees

The convergence analysis assumes the objective is layerwise smooth (each layer has a Lipschitz gradient constant LL_\ell) and that the stochastic gradient estimator is unbiased with noise bounded both above and below in the dual norm. By partitioning the noise bound σ\sigma_\ell per layer instead of globally, the scheme improves the convergence rate to

O~(1T+σˉT1/4)\widetilde{O}\left(\frac{1}{\sqrt{T}} + \sqrt{\frac{\sum_\ell \bar{\sigma}_\ell}{T^{1/4}}}\right)

where σˉ\bar{\sigma}_\ell denotes the maximal noise magnitude in layer \ell, and TT is the number of optimization steps (Hao et al., 15 Oct 2025). This layerwise refinement yields sharper theoretical bounds compared to analyses assuming uniform global noise, particularly when noise is concentrated in a subset of layers.

The analysis leverages two independent gradient samples for each layer in the computation of HtH_t^\ell to ensure unbiased estimation of the variance; in practice, autocorrelation effects may be mitigated by using a temporal difference or independent batches for the per-layer estimates.

4. Empirical Performance

Experiments on transformer architectures (GPT2-small/medium, LLaMA-0.5B/1.1B) trained on large datasets (OpenWebText, C4, MiniPile) demonstrate substantial acceleration in convergence. LANTON—the scheme introduced in (Hao et al., 15 Oct 2025)—reaches target performance measures (training and validation loss) with up to 1.5× fewer training tokens compared to state-of-the-art geometry-aware baselines (AdamW, Muon, SCION, D-Muon). Training/validation loss curves illustrate faster decline and improved sample efficiency, attributed to more precise step size scaling in response to per-layer noise.

Notably, transformer layers exhibiting high gradient noise are adaptively assigned lower step sizes, ameliorating stability concerns and precluding inefficient updates. In conventional optimizers lacking this adaptation, fixed rates can lead to slow convergence or oscillatory behavior in high-noise layers while underutilizing quieter layers.

Classic adaptive learning rate schemes (e.g., Adam, SDProp (Ida et al., 2016), AdaSecant (Gulcehre et al., 2017)) assign per-parameter step sizes based on local gradient statistics, often using historical gradient variance (uncentered or centered) as a proxy for noisiness. However, these methods typically operate in the coordinate, not operator, norm and do not account for geometric group structures present in modern deep models.

Layer-specific adaptive rates based on curvature estimates (e.g., (Singh et al., 2015)) can partially address vanishing gradients and saddle point escape but do not actively track stochastic noise magnitude in the group geometry’s dual norm. RL-based approaches (Xu et al., 2019) learn global schedules but lack per-layer noise sensitivity. Energy-reliability adaptations (Henwood et al., 2019) optimize parameter storage reliability, offering conceptual parallels; both approaches recognize inter-layer heterogeneity and seek layerwise optimums.

The presented scheme uniquely integrates geometry-awareness (via LMOs and induced dual norms) with dynamic per-layer noise estimation, resulting in updates that respect both structure and stochasticity, particularly crucial in large-scale models where layerwise sharpness evolves throughout training.

6. Practical Implications and Further Directions

Noise-adaptive layerwise learning rate schemes can be incorporated into any framework supporting geometry-aware LMOs; computation per layer is efficient, involving norm evaluations of gradients already computed for parameter updates. Hyperparameter sensitivity is mitigated, as rates adapt automatically with respect to observed noise, though careful tuning of base learning rates and momentum parameters may still be required.

Broader application domains include language modeling, where transformer architectures exhibit pronounced layer noise heterogeneity, and other deep models with complex parameter grouping. Extensions may address scalable adaptation in even larger foundation models, integration with dynamic second-order curvature approximations, and further reduction of manual tuning.

Research into the interplay between gradient noise, learning rate schedules, and the stability of training in dynamically evolving networks remains ongoing. Layerwise adaptation enables finer control, mitigates the inefficiencies of uniform step sizing, and supports robust, accelerated optimization in both theory and practice.

7. Summary Table: Key Aspects of Noise-Adaptive Layerwise Schemes

Aspect Classical Geometry-Aware Optimizers Noise-Adaptive Layerwise Scheme
Learning rate granularity Group-wise (fixed within group) Layerwise (dynamic per layer)
Noise estimation Often absent or global Per-layer, estimated in dual norm
Norm structure Group geometry (e.g., RMS, nuclear) Same, but variance uses dual norm
Adaptivity target Curvature or group structure Curvature and stochastic noise
Empirical convergence Efficient for uniform group Accelerated with heterogeneous layers

The introduction of noise-adaptive layerwise learning rates represents a significant step forward in geometry-aware deep network optimization, resolving inefficiencies of fixed intra-group rates and accommodating the heterogeneous noise environment of large-scale models (Hao et al., 15 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Noise-Adaptive Layerwise Learning Rate Scheme.