Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent (2503.22478v1)

Published 28 Mar 2025 in cs.LG, cs.AI, and math.OC

Abstract: We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.

Summary

  • The paper proposes modeling Stochastic Gradient Descent (SGD) dynamics as fractional diffusion on fractal neural network loss landscapes.
  • The paper links loss landscape geometry ( R ) and dynamic accessibility ( R ), providing numerical evidence that R R holds for SGD paths.
  • This theory suggests SGD is "Almost Bayesian," sampling based on loss and dynamic accessibility determined by the fractal landscape geometry.

The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" (2503.22478) investigates the relationship between the dynamics of Stochastic Gradient Descent (SGD) and Bayesian inference within the framework of Singular Learning Theory (SLT). It proposes that SGD behaves as a diffusion process on a fractal loss landscape, governed by a Fractional Fokker-Planck Equation (FFPE), and that this perspective reconciles SGD dynamics with Bayesian principles by incorporating accessibility constraints imposed by the landscape's geometry.

Introduction and Singular Learning Theory Context

Understanding the optimization dynamics and generalization properties of deep neural networks remains a central challenge. Classical statistical methods often falter due to the singular nature of deep learning loss landscapes, characterized by degenerate minima and non-identifiable parameters. Singular Learning Theory (SLT) provides tools to analyze such landscapes, notably through the concept of the local learning coefficient (LLC, λ(w)\lambda(w)). The LLC quantifies the geometric complexity around a parameter ww, describing how the volume of the parameter space with loss below a certain threshold scales. Specifically, for a small ϵ>0\epsilon > 0, the volume V(B(w,ϵ))V(B(w^*, \epsilon)) of parameters within a ball B(w)B(w^*) around a minimum ww^* such that L(w)L(w)+ϵL(w) \le L(w^*) + \epsilon scales as V(B(w,ϵ))ϵλ(w)V(B(w^*, \epsilon)) \propto \epsilon^{\lambda(w^*)}. While λ(w)\lambda(w) offers geometric insight from a Bayesian viewpoint, its direct connection to the dynamics induced by SGD has been less clear. This work aims to establish this connection by modeling SGD as a diffusion process influenced by the fractal geometry captured by λ(w)\lambda(w).

SGD Dynamics as Fractional Diffusion on a Fractal Landscape

The core argument posits that the standard Fokker-Planck Equation (FPE), which describes normal Brownian motion, is insufficient to model SGD dynamics in neural networks. Empirical evidence suggests that the mean squared displacement (MSD) of weights during SGD training exhibits anomalous subdiffusion, scaling as w(t)w(0)2tα\langle ||w(t) - w(0)||^2 \rangle \propto t^\alpha with α<1\alpha < 1, rather than the linear scaling (α=1\alpha=1) characteristic of normal diffusion.

To capture this subdiffusive behavior, the paper proposes modeling SGD dynamics using a Fractional Fokker-Planck Equation (FFPE):

p(w,t)t=Dt1α[(Lm[w]p(w,t))+D0γΔp(w,t)]\frac{\partial p(w, t)}{\partial t} = \mathcal{D}_t^{1-\alpha} \left[ \nabla \cdot (\nabla \mathcal{L}_m[w] p(w, t)) + \frac{D_0}{\gamma} \Delta p(w, t) \right]

Here, p(w,t)p(w,t) is the probability density of the weights ww at time tt, Lm[w]\mathcal{L}_m[w] is the empirical loss, D0D_0 is a bare diffusion constant, γ\gamma is related to friction, and Dt1α\mathcal{D}_t^{1-\alpha} is the Caputo fractional derivative of order 1α1-\alpha (0<α<10 < \alpha < 1). The fractional derivative introduces memory effects, characteristic of processes in complex or disordered media, such as diffusion on fractals. The potential driving the diffusion is the empirical loss Lm[w]\mathcal{L}_m[w].

The justification for using the FFPE stems from interpreting the loss landscape explored by SGD as having a fractal structure. The paper establishes a link between the LLC from SLT and the mass fractal dimension (dfd_f). The definition of λ(w)\lambda(w^*) via volume scaling Vϵλ(w)V \propto \epsilon^{\lambda(w^*)} is mathematically analogous to the definition of the mass dimension dfd_f, where the "mass" (or volume accessible) within a region of scale ϵ\epsilon scales as M(ϵ)ϵdfM(\epsilon) \propto \epsilon^{d_f}. Therefore, the paper identifies the LLC as the local mass fractal dimension of the loss landscape, particularly near metastable states:

df(w)λ(w)d_f(w) \approx \lambda(w)

This identification is supported by the "Near Stability Hypothesis," which suggests that SGD spends significant time exploring regions near such stable points where the SLT characterization is applicable.

Connecting Fractal Geometry to SGD Dynamics

The dynamics of diffusion on a fractal are governed not only by its mass dimension (dfd_f) but also by its spectral dimension (dsd_s). The spectral dimension characterizes the connectivity and recurrence properties of a random walk on the fractal, measuring how the number of distinct sites visited scales with time tt, typically as tds/2t^{d_s/2}. A lower dsd_s implies slower exploration and increased likelihood of revisiting sites, indicative of trapping or tortuous paths.

The interplay between dfλ(w)d_f \approx \lambda(w) and dsd_s determines the anomalous diffusion exponent observed in the MSD of weights. The displacement R(t)=w(t)w(0)2R(t) = \sqrt{\langle ||w(t) - w(0)||^2 \rangle} scales with the walker dimension (dwalkd_{walk}) as R(t)t1/dwalkR(t) \sim t^{1/d_{walk}}. For diffusion on fractals, it is known that dwalk=2df/dsd_{walk} = 2 d_f / d_s. Substituting dfλ(w)d_f \approx \lambda(w), we get:

R(t)tds2λ(w)R(t) \sim t^{\frac{d_s}{2\lambda(w)}}

This directly links the observed subdiffusive exponent α=ds/λ(w)\alpha = d_s / \lambda(w) (in terms of MSD) to the geometric property λ(w)\lambda(w) and the dynamic property dsd_s of the fractal landscape.

Furthermore, the paper derives a theoretical constraint relating these dimensions in the long-time limit (Lemma 3.2):

dsλ(w(t))d_s \leq \lambda(w(t))

and correspondingly for the time-averaged LLC, λˉ(w(t))\bar{\lambda}(w(t)) (Corollary 3.2). This inequality implies that the rate of exploration, captured by dsd_s, is fundamentally limited by the local volume or density of states, captured by λ(w)\lambda(w). Intuitively, the diffusion process cannot explore the space faster than the underlying fractal structure permits.

To facilitate analysis at a macroscopic scale, an effective diffusion coefficient (DξD_\xi) is introduced via homogenization. This coefficient depends on the local diffusion exponent ν(w)=1/α=λ(w)/ds\nu(w) = 1/\alpha = \lambda(w)/d_s (note: the paper uses ν2\nu \ge 2 for MSD scaling t1/νt^{1/\nu}, so ν=2λ(w)/ds\nu = 2\lambda(w)/d_s here) and a characteristic length scale ξ\xi:

Dξ(w)=ξ21/ν(w)=ξ2ds/(2λ(w))D_\xi(w) = \xi^{2 - 1/\nu(w)} = \xi^{2 - d_s / (2\lambda(w))}

This Dξ(w)D_\xi(w) captures the effective mobility of the SGD process, averaged over the complexities of the fractal structure at scale ξ\xi.

The "Almost Bayesian" Nature of SGD

The connection to Bayesian inference arises from examining the steady-state solution of the FFPE. Assuming the effective diffusion coefficient Dξ(w)D_\xi(w) is approximately constant within a local region W\mathcal{W}, the FFPE admits a steady-state distribution (p/t=0\partial p / \partial t = 0) given by (Lemma 3.1):

ps(w)exp(γLm[w]Dξ(w))p_{s}(w) \propto \exp\left( -\frac{\gamma \mathcal{L}_m[w]}{D_\xi(w)} \right)

This distribution resembles a Boltzmann distribution with an effective temperature proportional to Dξ(w)/γD_\xi(w)/\gamma.

Comparing this to the Bayesian posterior distribution p(wXm)ρ(w)p(Xmw)p(w|X_m) \propto \rho(w) p(X_m|w), where ρ(w)\rho(w) is the prior and p(Xmw)emLm[w]p(X_m|w) \propto e^{-m\mathcal{L}_m[w]} (assuming Lm\mathcal{L}_m is the normalized negative log-likelihood), we see a resemblance (Corollary 3.1). The standard Bayesian posterior corresponds to a Gibbs distribution emβLm[w]e^{-m\beta \mathcal{L}_m[w]} (ignoring the prior), where β\beta is an inverse temperature. If we relate β\beta to a diffusion constant DD via the Einstein relation (D1/βD \propto 1/\beta), the standard Bayesian approach implicitly assumes a constant effective temperature or diffusion coefficient across the parameter space.

In contrast, the SGD steady-state distribution involves Dξ(w)D_\xi(w), which depends explicitly on the local fractal geometry (λ(w)\lambda(w)) and connectivity (dsd_s). Therefore, SGD is "Almost Bayesian" in the sense that its stationary distribution is analogous to a Bayesian posterior, but it is modified by a state-dependent effective temperature/diffusion coefficient Dξ(w)D_\xi(w). This modification incorporates dynamical accessibility constraints: regions of the loss landscape that are energetically favorable (low Lm[w]\mathcal{L}_m[w]) but difficult to reach due to complex fractal structure (low dsd_s relative to λ(w)\lambda(w), leading to small Dξ(w)D_\xi(w)) will have lower probability mass under the SGD steady state compared to the pure Bayesian posterior. SGD's exploration is thus biased towards regions that are not only low-loss but also dynamically accessible within the fractal terrain.

Experimental Validation and Numerical Results

The paper provides numerical experiments using fully connected networks on MNIST to support its theoretical claims:

  1. Anomalous Diffusion: Measurements of weight displacement R(t)R(t) confirm power-law scaling R(t)t1/(2ν)R(t) \sim t^{1/(2\nu)} with ν1\nu \ge 1, consistent with subdiffusion (Fig 1). This supports the use of the FFPE framework.
  2. Dimensional Constraint: Estimates of the spectral dimension dsd_s (derived from the diffusion exponent) and the LLC λ(w)\lambda(w) (estimated using methods from SLT) are computed during training. The results consistently show dsλ(w)d_s \leq \lambda(w) and dsλˉ(w)d_s \leq \bar{\lambda}(w) (averaged LLC), validating the theoretical inequality derived from fractal diffusion principles (Fig 2, Fig 3). This is a strong numerical result supporting the fractal diffusion model.
  3. Diffusion Exponent Behavior: Histograms of the estimated diffusion exponent ν(w)\nu(w) show that it tends to concentrate towards higher values across different runs (Fig 4). Since higher ν\nu corresponds to relatively faster diffusion (closer to normal diffusion), this suggests SGD tends to find and explore regions of the landscape that are more dynamically accessible within the fractal structure, aligning with the properties of the derived steady-state distribution ps(w)p_s(w).
  4. Correlation with Generalization: Both the estimated LLC λ(w)\lambda(w) and the spectral dimension dsd_s show correlation with the generalization error of the final model (Fig 5). This links the geometric (λ\lambda) and dynamic (dsd_s) fractal properties identified by the theory to model performance, suggesting their relevance in understanding generalization.

Practical Implications and Implementation Considerations

While highly theoretical, this work offers several practical implications:

  1. Interpreting SGD: It provides a physics-based framework for understanding why SGD finds certain types of solutions. The concept of accessibility (DξD_\xi) explains why SGD might prefer wider, more accessible minima over potentially deeper but geometrically complex ones, complementing energy-based arguments.
  2. Hyperparameter Tuning: The learning rate, batch size, and momentum affect the effective diffusion and noise scale in SGD. This theory suggests their impact could be analyzed in terms of how they modify the exploration dynamics (dsd_s) and interaction with the fractal geometry (λ(w)\lambda(w)) characterized by Dξ(w)D_\xi(w).
  3. Algorithm Design: The insights could potentially inspire new optimization algorithms. For instance, methods could be designed to adaptively control the effective temperature or diffusion based on estimates of local λ(w)\lambda(w) and dsd_s to enhance exploration or convergence in specific landscape regions.
  4. Training Diagnostics: Monitoring estimates of λ(w)\lambda(w) and dsd_s during training could serve as diagnostic tools.
    • Estimating λ(w)\lambda(w) typically requires techniques like analyzing the loss Hessian eigenspectrum near convergence or using tools like WBIC. This remains computationally intensive.
    • Estimating dsd_s involves tracking the MSD of weights over time windows: w(t+Δt)w(t)2(Δt)ds/λ(w)\langle ||w(t+\Delta t) - w(t)||^2 \rangle \propto (\Delta t)^{d_s/\lambda(w)}. This requires storing weight trajectories and performing statistical analysis, adding overhead. The accuracy depends on the choice of time window Δt\Delta t and sufficient statistics.
  5. Limitations: The theory relies on several assumptions, such as the applicability of the FFPE, the identification dfλ(w)d_f \approx \lambda(w), the near-stability hypothesis, and local constancy of DξD_\xi for the steady-state analysis. The experimental validation is currently limited to specific architectures and datasets. The computational cost of estimating λ(w)\lambda(w) and dsd_s hinders their routine use.

Conclusion

The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" presents a compelling theoretical framework that unifies SGD optimization dynamics, the geometric complexity described by SLT (λ(w)\lambda(w)), and Bayesian inference. By modeling SGD as diffusion on a fractal landscape governed by an FFPE, it explains the observed anomalous diffusion of weights and interprets SGD as a modified Bayesian sampler whose exploration is constrained by the dynamic accessibility (dsd_s, DξD_\xi) of the fractal loss landscape. This perspective offers a deeper understanding of how SGD navigates complex energy surfaces and selects solutions, supported by numerical experiments validating key theoretical predictions like the relationship dsλ(w)d_s \le \lambda(w).

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com