Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent (2503.22478v1)

Published 28 Mar 2025 in cs.LG, cs.AI, and math.OC

Abstract: We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.

Summary

The paper proposes modeling Stochastic Gradient Descent (SGD) dynamics as fractional diffusion on fractal neural network loss landscapes.
The paper links loss landscape geometry ( R ) and dynamic accessibility ( R ), providing numerical evidence that R R holds for SGD paths.
This theory suggests SGD is "Almost Bayesian," sampling based on loss and dynamic accessibility determined by the fractal landscape geometry.

The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" (2503.22478) investigates the relationship between the dynamics of Stochastic Gradient Descent (SGD) and Bayesian inference within the framework of Singular Learning Theory (SLT). It proposes that SGD behaves as a diffusion process on a fractal loss landscape, governed by a Fractional Fokker-Planck Equation (FFPE), and that this perspective reconciles SGD dynamics with Bayesian principles by incorporating accessibility constraints imposed by the landscape's geometry.

Introduction and Singular Learning Theory Context

Understanding the optimization dynamics and generalization properties of deep neural networks remains a central challenge. Classical statistical methods often falter due to the singular nature of deep learning loss landscapes, characterized by degenerate minima and non-identifiable parameters. Singular Learning Theory (SLT) provides tools to analyze such landscapes, notably through the concept of the local learning coefficient (LLC, $\lambda(w)$ ). The LLC quantifies the geometric complexity around a parameter $w$ , describing how the volume of the parameter space with loss below a certain threshold scales. Specifically, for a small $\epsilon > 0$ , the volume $V(B(w^*, \epsilon))$ of parameters within a ball $B(w^*)$ around a minimum $w^*$ such that $L(w) \le L(w^*) + \epsilon$ scales as $V(B(w^*, \epsilon)) \propto \epsilon^{\lambda(w^*)}$ . While $\lambda(w)$ offers geometric insight from a Bayesian viewpoint, its direct connection to the dynamics induced by SGD has been less clear. This work aims to establish this connection by modeling SGD as a diffusion process influenced by the fractal geometry captured by $\lambda(w)$ .

SGD Dynamics as Fractional Diffusion on a Fractal Landscape

The core argument posits that the standard Fokker-Planck Equation (FPE), which describes normal Brownian motion, is insufficient to model SGD dynamics in neural networks. Empirical evidence suggests that the mean squared displacement (MSD) of weights during SGD training exhibits anomalous subdiffusion, scaling as $\langle ||w(t) - w(0)||^2 \rangle \propto t^\alpha$ with $\alpha < 1$ , rather than the linear scaling ( $\alpha=1$ ) characteristic of normal diffusion.

To capture this subdiffusive behavior, the paper proposes modeling SGD dynamics using a Fractional Fokker-Planck Equation (FFPE):

$\frac{\partial p(w, t)}{\partial t} = \mathcal{D}_t^{1-\alpha} \left[ \nabla \cdot (\nabla \mathcal{L}_m[w] p(w, t)) + \frac{D_0}{\gamma} \Delta p(w, t) \right]$

Here, $p(w,t)$ is the probability density of the weights $w$ at time $t$ , $\mathcal{L}_m[w]$ is the empirical loss, $D_0$ is a bare diffusion constant, $\gamma$ is related to friction, and $\mathcal{D}_t^{1-\alpha}$ is the Caputo fractional derivative of order $1-\alpha$ ( $0 < \alpha < 1$ ). The fractional derivative introduces memory effects, characteristic of processes in complex or disordered media, such as diffusion on fractals. The potential driving the diffusion is the empirical loss $\mathcal{L}_m[w]$ .

The justification for using the FFPE stems from interpreting the loss landscape explored by SGD as having a fractal structure. The paper establishes a link between the LLC from SLT and the mass fractal dimension ( $d_f$ ). The definition of $\lambda(w^*)$ via volume scaling $V \propto \epsilon^{\lambda(w^*)}$ is mathematically analogous to the definition of the mass dimension $d_f$ , where the "mass" (or volume accessible) within a region of scale $\epsilon$ scales as $M(\epsilon) \propto \epsilon^{d_f}$ . Therefore, the paper identifies the LLC as the local mass fractal dimension of the loss landscape, particularly near metastable states:

$d_f(w) \approx \lambda(w)$

This identification is supported by the "Near Stability Hypothesis," which suggests that SGD spends significant time exploring regions near such stable points where the SLT characterization is applicable.

Connecting Fractal Geometry to SGD Dynamics

The dynamics of diffusion on a fractal are governed not only by its mass dimension ( $d_f$ ) but also by its spectral dimension ( $d_s$ ). The spectral dimension characterizes the connectivity and recurrence properties of a random walk on the fractal, measuring how the number of distinct sites visited scales with time $t$ , typically as $t^{d_s/2}$ . A lower $d_s$ implies slower exploration and increased likelihood of revisiting sites, indicative of trapping or tortuous paths.

The interplay between $d_f \approx \lambda(w)$ and $d_s$ determines the anomalous diffusion exponent observed in the MSD of weights. The displacement $R(t) = \sqrt{\langle ||w(t) - w(0)||^2 \rangle}$ scales with the walker dimension ( $d_{walk}$ ) as $R(t) \sim t^{1/d_{walk}}$ . For diffusion on fractals, it is known that $d_{walk} = 2 d_f / d_s$ . Substituting $d_f \approx \lambda(w)$ , we get:

$R(t) \sim t^{\frac{d_s}{2\lambda(w)}}$

This directly links the observed subdiffusive exponent $\alpha = d_s / \lambda(w)$ (in terms of MSD) to the geometric property $\lambda(w)$ and the dynamic property $d_s$ of the fractal landscape.

Furthermore, the paper derives a theoretical constraint relating these dimensions in the long-time limit (Lemma 3.2):

$d_s \leq \lambda(w(t))$

and correspondingly for the time-averaged LLC, $\bar{\lambda}(w(t))$ (Corollary 3.2). This inequality implies that the rate of exploration, captured by $d_s$ , is fundamentally limited by the local volume or density of states, captured by $\lambda(w)$ . Intuitively, the diffusion process cannot explore the space faster than the underlying fractal structure permits.

To facilitate analysis at a macroscopic scale, an effective diffusion coefficient ( $D_\xi$ ) is introduced via homogenization. This coefficient depends on the local diffusion exponent $\nu(w) = 1/\alpha = \lambda(w)/d_s$ (note: the paper uses $\nu \ge 2$ for MSD scaling $t^{1/\nu}$ , so $\nu = 2\lambda(w)/d_s$ here) and a characteristic length scale $\xi$ :

$D_\xi(w) = \xi^{2 - 1/\nu(w)} = \xi^{2 - d_s / (2\lambda(w))}$

This $D_\xi(w)$ captures the effective mobility of the SGD process, averaged over the complexities of the fractal structure at scale $\xi$ .

The "Almost Bayesian" Nature of SGD

The connection to Bayesian inference arises from examining the steady-state solution of the FFPE. Assuming the effective diffusion coefficient $D_\xi(w)$ is approximately constant within a local region $\mathcal{W}$ , the FFPE admits a steady-state distribution ( $\partial p / \partial t = 0$ ) given by (Lemma 3.1):

$p_{s}(w) \propto \exp\left( -\frac{\gamma \mathcal{L}_m[w]}{D_\xi(w)} \right)$

This distribution resembles a Boltzmann distribution with an effective temperature proportional to $D_\xi(w)/\gamma$ .

Comparing this to the Bayesian posterior distribution $p(w|X_m) \propto \rho(w) p(X_m|w)$ , where $\rho(w)$ is the prior and $p(X_m|w) \propto e^{-m\mathcal{L}_m[w]}$ (assuming $\mathcal{L}_m$ is the normalized negative log-likelihood), we see a resemblance (Corollary 3.1). The standard Bayesian posterior corresponds to a Gibbs distribution $e^{-m\beta \mathcal{L}_m[w]}$ (ignoring the prior), where $\beta$ is an inverse temperature. If we relate $\beta$ to a diffusion constant $D$ via the Einstein relation ( $D \propto 1/\beta$ ), the standard Bayesian approach implicitly assumes a constant effective temperature or diffusion coefficient across the parameter space.

In contrast, the SGD steady-state distribution involves $D_\xi(w)$ , which depends explicitly on the local fractal geometry ( $\lambda(w)$ ) and connectivity ( $d_s$ ). Therefore, SGD is "Almost Bayesian" in the sense that its stationary distribution is analogous to a Bayesian posterior, but it is modified by a state-dependent effective temperature/diffusion coefficient $D_\xi(w)$ . This modification incorporates dynamical accessibility constraints: regions of the loss landscape that are energetically favorable (low $\mathcal{L}_m[w]$ ) but difficult to reach due to complex fractal structure (low $d_s$ relative to $\lambda(w)$ , leading to small $D_\xi(w)$ ) will have lower probability mass under the SGD steady state compared to the pure Bayesian posterior. SGD's exploration is thus biased towards regions that are not only low-loss but also dynamically accessible within the fractal terrain.

Experimental Validation and Numerical Results

The paper provides numerical experiments using fully connected networks on MNIST to support its theoretical claims:

Anomalous Diffusion: Measurements of weight displacement $R(t)$ confirm power-law scaling $R(t) \sim t^{1/(2\nu)}$ with $\nu \ge 1$ , consistent with subdiffusion (Fig 1). This supports the use of the FFPE framework.
Dimensional Constraint: Estimates of the spectral dimension $d_s$ (derived from the diffusion exponent) and the LLC $\lambda(w)$ (estimated using methods from SLT) are computed during training. The results consistently show $d_s \leq \lambda(w)$ and $d_s \leq \bar{\lambda}(w)$ (averaged LLC), validating the theoretical inequality derived from fractal diffusion principles (Fig 2, Fig 3). This is a strong numerical result supporting the fractal diffusion model.
Diffusion Exponent Behavior: Histograms of the estimated diffusion exponent $\nu(w)$ show that it tends to concentrate towards higher values across different runs (Fig 4). Since higher $\nu$ corresponds to relatively faster diffusion (closer to normal diffusion), this suggests SGD tends to find and explore regions of the landscape that are more dynamically accessible within the fractal structure, aligning with the properties of the derived steady-state distribution $p_s(w)$ .
Correlation with Generalization: Both the estimated LLC $\lambda(w)$ and the spectral dimension $d_s$ show correlation with the generalization error of the final model (Fig 5). This links the geometric ( $\lambda$ ) and dynamic ( $d_s$ ) fractal properties identified by the theory to model performance, suggesting their relevance in understanding generalization.

Practical Implications and Implementation Considerations

While highly theoretical, this work offers several practical implications:

Interpreting SGD: It provides a physics-based framework for understanding why SGD finds certain types of solutions. The concept of accessibility ( $D_\xi$ ) explains why SGD might prefer wider, more accessible minima over potentially deeper but geometrically complex ones, complementing energy-based arguments.
Hyperparameter Tuning: The learning rate, batch size, and momentum affect the effective diffusion and noise scale in SGD. This theory suggests their impact could be analyzed in terms of how they modify the exploration dynamics ( $d_s$ ) and interaction with the fractal geometry ( $\lambda(w)$ ) characterized by $D_\xi(w)$ .
Algorithm Design: The insights could potentially inspire new optimization algorithms. For instance, methods could be designed to adaptively control the effective temperature or diffusion based on estimates of local $\lambda(w)$ and $d_s$ to enhance exploration or convergence in specific landscape regions.
Training Diagnostics: Monitoring estimates of $\lambda(w)$ $λ (w)$ and $d_s$ $d_{s}$ during training could serve as diagnostic tools.
- Estimating $\lambda(w)$ typically requires techniques like analyzing the loss Hessian eigenspectrum near convergence or using tools like WBIC. This remains computationally intensive.
- Estimating $d_s$ involves tracking the MSD of weights over time windows: $\langle ||w(t+\Delta t) - w(t)||^2 \rangle \propto (\Delta t)^{d_s/\lambda(w)}$ . This requires storing weight trajectories and performing statistical analysis, adding overhead. The accuracy depends on the choice of time window $\Delta t$ and sufficient statistics.
Limitations: The theory relies on several assumptions, such as the applicability of the FFPE, the identification $d_f \approx \lambda(w)$ , the near-stability hypothesis, and local constancy of $D_\xi$ for the steady-state analysis. The experimental validation is currently limited to specific architectures and datasets. The computational cost of estimating $\lambda(w)$ and $d_s$ hinders their routine use.

Conclusion

The paper "Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent" presents a compelling theoretical framework that unifies SGD optimization dynamics, the geometric complexity described by SLT ( $\lambda(w)$ ), and Bayesian inference. By modeling SGD as diffusion on a fractal landscape governed by an FFPE, it explains the observed anomalous diffusion of weights and interprets SGD as a modified Bayesian sampler whose exploration is constrained by the dynamic accessibility ( $d_s$ , $D_\xi$ ) of the fractal loss landscape. This perspective offers a deeper understanding of how SGD navigates complex energy surfaces and selects solutions, supported by numerical experiments validating key theoretical predictions like the relationship $d_s \le \lambda(w)$ .

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/plain_simon/status/1928108931328495957

YouTube

Show All Videos