An Analytical Theory of Power Law Spectral Bias in the Learning Dynamics of Diffusion Models (2503.03206v1)

Published 5 Mar 2025 in cs.LG, cs.CV, math.ST, stat.ML, and stat.TH

Abstract: We developed an analytical framework for understanding how the learned distribution evolves during diffusion model training. Leveraging the Gaussian equivalence principle, we derived exact solutions for the gradient-flow dynamics of weights in one- or two-layer linear denoiser settings with arbitrary data. Remarkably, these solutions allowed us to derive the generated distribution in closed form and its KL divergence through training. These analytical results expose a pronounced power-law spectral bias, i.e., for weights and distributions, the convergence time of a mode follows an inverse power law of its variance. Empirical experiments on both Gaussian and image datasets demonstrate that the power-law spectral bias remains robust even when using deeper or convolutional architectures. Our results underscore the importance of the data covariance in dictating the order and rate at which diffusion models learn different modes of the data, providing potential explanations for why earlier stopping could lead to incorrect details in image generative models.

Summary

Analytical Theory of Power Law Spectral Bias in Diffusion Learning Dynamics

Overview

The paper "An Analytical Theory of Power Law Spectral Bias in the Learning Dynamics of Diffusion Models" by Binxu Wang offers a comprehensive analytical framework to understand the learning dynamics within diffusion models. The study is grounded on the examination of gradient-flow dynamics, particularly in linear denoiser settings, and uncovers how learning unfolds over the spectrum of the data covariance. The findings encompass both theoretical derivations and empirical validations across both Gaussian and natural image datasets.

Main Contributions

The primary contribution of this paper is the elucidation of a pronounced power-law spectral bias in diffusion models. The analysis, based on a simplified linear denoiser setup, reveals that modes with larger variances converge more swiftly, following an inverse power law relative to their variances.

Analytical Solutions for Gradient Flow: The paper exploits the Gaussian equivalence principle to derive exact solutions for gradient-flow dynamics. This is done in one-layer and two-layer linear denoiser models. This derivation addresses the convergence behavior across modes of the data covariance, emphasizing how eigenmodes with higher variance are learned faster.
Practical Implications: By deriving the distribution output explicitly and its KL-divergence through training, the results provide insights into why improper early stopping often fails to capture intricate details in generative models, particularly in low-variance modes. This has practical implications in understanding artifacts in generated images—such as unnatural features—in incomplete training scenarios.
Empirical Validation: Empirical experiments validate the robustness of the discovered spectral bias across both synthetic datasets (e.g., Gaussian) and real image datasets (e.g., MNIST). This demonstrates that the generated distribution follows a power-law pattern in convergence time versus mode variance, even in more complex scenarios involving deeper architectures or convolutional layers.

Theoretical Implications

Theoretically, the paper advances our understanding of spectral bias in machine learning, extending it into the domain of diffusion models. This work aligns with broader studies on spectral bias found in other learning contexts such as kernel methods and overparameterized neural networks but now tailored to address the stochastic nature of diffusion learning dynamics.

Future Developments and Applications in AI

The insights from this paper suggest potential directions for enhancing convergence efficacy in large diffusion models via preconditioning techniques to amplify low-variance modes. Exploring nonlinear whitening methods and employing advanced architecture designs could mitigate the spectral bias, thus improving model efficacy in generating fine details. Additionally, the work suggests that understanding the spectral characteristics of datasets can guide the design and training protocols in both academic and practical deployment of AI systems.

Conclusion

Ultimately, this paper provides a compelling analytical framework explaining the emergent dynamics of mode convergence in diffusion models. By identifying a power-law spectral bias, it opens new avenues for refining how these models are trained and offers significant theoretical contributions to our understanding of learning dynamics in complex generative models.