Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent (2503.22478v1)
Abstract: We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.
Summary
- The paper proposes modeling Stochastic Gradient Descent (SGD) dynamics as fractional diffusion on fractal neural network loss landscapes.
- The paper links loss landscape geometry ( R ) and dynamic accessibility ( R ), providing numerical evidence that R R holds for SGD paths.
- This theory suggests SGD is "Almost Bayesian," sampling based on loss and dynamic accessibility determined by the fractal landscape geometry.
The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" (2503.22478) investigates the relationship between the dynamics of Stochastic Gradient Descent (SGD) and Bayesian inference within the framework of Singular Learning Theory (SLT). It proposes that SGD behaves as a diffusion process on a fractal loss landscape, governed by a Fractional Fokker-Planck Equation (FFPE), and that this perspective reconciles SGD dynamics with Bayesian principles by incorporating accessibility constraints imposed by the landscape's geometry.
Introduction and Singular Learning Theory Context
Understanding the optimization dynamics and generalization properties of deep neural networks remains a central challenge. Classical statistical methods often falter due to the singular nature of deep learning loss landscapes, characterized by degenerate minima and non-identifiable parameters. Singular Learning Theory (SLT) provides tools to analyze such landscapes, notably through the concept of the local learning coefficient (LLC, λ(w)). The LLC quantifies the geometric complexity around a parameter w, describing how the volume of the parameter space with loss below a certain threshold scales. Specifically, for a small ϵ>0, the volume V(B(w∗,ϵ)) of parameters within a ball B(w∗) around a minimum w∗ such that L(w)≤L(w∗)+ϵ scales as V(B(w∗,ϵ))∝ϵλ(w∗). While λ(w) offers geometric insight from a Bayesian viewpoint, its direct connection to the dynamics induced by SGD has been less clear. This work aims to establish this connection by modeling SGD as a diffusion process influenced by the fractal geometry captured by λ(w).
SGD Dynamics as Fractional Diffusion on a Fractal Landscape
The core argument posits that the standard Fokker-Planck Equation (FPE), which describes normal Brownian motion, is insufficient to model SGD dynamics in neural networks. Empirical evidence suggests that the mean squared displacement (MSD) of weights during SGD training exhibits anomalous subdiffusion, scaling as ⟨∣∣w(t)−w(0)∣∣2⟩∝tα with α<1, rather than the linear scaling (α=1) characteristic of normal diffusion.
To capture this subdiffusive behavior, the paper proposes modeling SGD dynamics using a Fractional Fokker-Planck Equation (FFPE):
∂t∂p(w,t)=Dt1−α[∇⋅(∇Lm[w]p(w,t))+γD0Δp(w,t)]
Here, p(w,t) is the probability density of the weights w at time t, Lm[w] is the empirical loss, D0 is a bare diffusion constant, γ is related to friction, and Dt1−α is the Caputo fractional derivative of order 1−α (0<α<1). The fractional derivative introduces memory effects, characteristic of processes in complex or disordered media, such as diffusion on fractals. The potential driving the diffusion is the empirical loss Lm[w].
The justification for using the FFPE stems from interpreting the loss landscape explored by SGD as having a fractal structure. The paper establishes a link between the LLC from SLT and the mass fractal dimension (df). The definition of λ(w∗) via volume scaling V∝ϵλ(w∗) is mathematically analogous to the definition of the mass dimension df, where the "mass" (or volume accessible) within a region of scale ϵ scales as M(ϵ)∝ϵdf. Therefore, the paper identifies the LLC as the local mass fractal dimension of the loss landscape, particularly near metastable states:
df(w)≈λ(w)
This identification is supported by the "Near Stability Hypothesis," which suggests that SGD spends significant time exploring regions near such stable points where the SLT characterization is applicable.
Connecting Fractal Geometry to SGD Dynamics
The dynamics of diffusion on a fractal are governed not only by its mass dimension (df) but also by its spectral dimension (ds). The spectral dimension characterizes the connectivity and recurrence properties of a random walk on the fractal, measuring how the number of distinct sites visited scales with time t, typically as tds/2. A lower ds implies slower exploration and increased likelihood of revisiting sites, indicative of trapping or tortuous paths.
The interplay between df≈λ(w) and ds determines the anomalous diffusion exponent observed in the MSD of weights. The displacement R(t)=⟨∣∣w(t)−w(0)∣∣2⟩ scales with the walker dimension (dwalk) as R(t)∼t1/dwalk. For diffusion on fractals, it is known that dwalk=2df/ds. Substituting df≈λ(w), we get:
R(t)∼t2λ(w)ds
This directly links the observed subdiffusive exponent α=ds/λ(w) (in terms of MSD) to the geometric property λ(w) and the dynamic property ds of the fractal landscape.
Furthermore, the paper derives a theoretical constraint relating these dimensions in the long-time limit (Lemma 3.2):
ds≤λ(w(t))
and correspondingly for the time-averaged LLC, λˉ(w(t)) (Corollary 3.2). This inequality implies that the rate of exploration, captured by ds, is fundamentally limited by the local volume or density of states, captured by λ(w). Intuitively, the diffusion process cannot explore the space faster than the underlying fractal structure permits.
To facilitate analysis at a macroscopic scale, an effective diffusion coefficient (Dξ) is introduced via homogenization. This coefficient depends on the local diffusion exponent ν(w)=1/α=λ(w)/ds (note: the paper uses ν≥2 for MSD scaling t1/ν, so ν=2λ(w)/ds here) and a characteristic length scale ξ:
Dξ(w)=ξ2−1/ν(w)=ξ2−ds/(2λ(w))
This Dξ(w) captures the effective mobility of the SGD process, averaged over the complexities of the fractal structure at scale ξ.
The "Almost Bayesian" Nature of SGD
The connection to Bayesian inference arises from examining the steady-state solution of the FFPE. Assuming the effective diffusion coefficient Dξ(w) is approximately constant within a local region W, the FFPE admits a steady-state distribution (∂p/∂t=0) given by (Lemma 3.1):
ps(w)∝exp(−Dξ(w)γLm[w])
This distribution resembles a Boltzmann distribution with an effective temperature proportional to Dξ(w)/γ.
Comparing this to the Bayesian posterior distribution p(w∣Xm)∝ρ(w)p(Xm∣w), where ρ(w) is the prior and p(Xm∣w)∝e−mLm[w] (assuming Lm is the normalized negative log-likelihood), we see a resemblance (Corollary 3.1). The standard Bayesian posterior corresponds to a Gibbs distribution e−mβLm[w] (ignoring the prior), where β is an inverse temperature. If we relate β to a diffusion constant D via the Einstein relation (D∝1/β), the standard Bayesian approach implicitly assumes a constant effective temperature or diffusion coefficient across the parameter space.
In contrast, the SGD steady-state distribution involves Dξ(w), which depends explicitly on the local fractal geometry (λ(w)) and connectivity (ds). Therefore, SGD is "Almost Bayesian" in the sense that its stationary distribution is analogous to a Bayesian posterior, but it is modified by a state-dependent effective temperature/diffusion coefficient Dξ(w). This modification incorporates dynamical accessibility constraints: regions of the loss landscape that are energetically favorable (low Lm[w]) but difficult to reach due to complex fractal structure (low ds relative to λ(w), leading to small Dξ(w)) will have lower probability mass under the SGD steady state compared to the pure Bayesian posterior. SGD's exploration is thus biased towards regions that are not only low-loss but also dynamically accessible within the fractal terrain.
Experimental Validation and Numerical Results
The paper provides numerical experiments using fully connected networks on MNIST to support its theoretical claims:
- Anomalous Diffusion: Measurements of weight displacement R(t) confirm power-law scaling R(t)∼t1/(2ν) with ν≥1, consistent with subdiffusion (Fig 1). This supports the use of the FFPE framework.
- Dimensional Constraint: Estimates of the spectral dimension ds (derived from the diffusion exponent) and the LLC λ(w) (estimated using methods from SLT) are computed during training. The results consistently show ds≤λ(w) and ds≤λˉ(w) (averaged LLC), validating the theoretical inequality derived from fractal diffusion principles (Fig 2, Fig 3). This is a strong numerical result supporting the fractal diffusion model.
- Diffusion Exponent Behavior: Histograms of the estimated diffusion exponent ν(w) show that it tends to concentrate towards higher values across different runs (Fig 4). Since higher ν corresponds to relatively faster diffusion (closer to normal diffusion), this suggests SGD tends to find and explore regions of the landscape that are more dynamically accessible within the fractal structure, aligning with the properties of the derived steady-state distribution ps(w).
- Correlation with Generalization: Both the estimated LLC λ(w) and the spectral dimension ds show correlation with the generalization error of the final model (Fig 5). This links the geometric (λ) and dynamic (ds) fractal properties identified by the theory to model performance, suggesting their relevance in understanding generalization.
Practical Implications and Implementation Considerations
While highly theoretical, this work offers several practical implications:
- Interpreting SGD: It provides a physics-based framework for understanding why SGD finds certain types of solutions. The concept of accessibility (Dξ) explains why SGD might prefer wider, more accessible minima over potentially deeper but geometrically complex ones, complementing energy-based arguments.
- Hyperparameter Tuning: The learning rate, batch size, and momentum affect the effective diffusion and noise scale in SGD. This theory suggests their impact could be analyzed in terms of how they modify the exploration dynamics (ds) and interaction with the fractal geometry (λ(w)) characterized by Dξ(w).
- Algorithm Design: The insights could potentially inspire new optimization algorithms. For instance, methods could be designed to adaptively control the effective temperature or diffusion based on estimates of local λ(w) and ds to enhance exploration or convergence in specific landscape regions.
- Training Diagnostics: Monitoring estimates of λ(w) and ds during training could serve as diagnostic tools.
- Estimating λ(w) typically requires techniques like analyzing the loss Hessian eigenspectrum near convergence or using tools like WBIC. This remains computationally intensive.
- Estimating ds involves tracking the MSD of weights over time windows: ⟨∣∣w(t+Δt)−w(t)∣∣2⟩∝(Δt)ds/λ(w). This requires storing weight trajectories and performing statistical analysis, adding overhead. The accuracy depends on the choice of time window Δt and sufficient statistics.
- Limitations: The theory relies on several assumptions, such as the applicability of the FFPE, the identification df≈λ(w), the near-stability hypothesis, and local constancy of Dξ for the steady-state analysis. The experimental validation is currently limited to specific architectures and datasets. The computational cost of estimating λ(w) and ds hinders their routine use.
Conclusion
The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" presents a compelling theoretical framework that unifies SGD optimization dynamics, the geometric complexity described by SLT (λ(w)), and Bayesian inference. By modeling SGD as diffusion on a fractal landscape governed by an FFPE, it explains the observed anomalous diffusion of weights and interprets SGD as a modified Bayesian sampler whose exploration is constrained by the dynamic accessibility (ds, Dξ) of the fractal loss landscape. This perspective offers a deeper understanding of how SGD navigates complex energy surfaces and selects solutions, supported by numerical experiments validating key theoretical predictions like the relationship ds≤λ(w).