Generalization through variance: how noise shapes inductive biases in diffusion models
(2504.12532v1)
Published 16 Apr 2025 in cs.LG, cond-mat.dis-nn, and cs.AI
Abstract: How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train diffusion models is the score function of the training distribution; and the networks usually used to learn the score function are expressive enough to learn this score to high accuracy. We claim that a certain feature of the DSM objective -- the fact that its target is not the training distribution's score, but a noisy quantity only equal to it in expectation -- strongly impacts whether and to what extent diffusion models generalize. In this paper, we develop a mathematical theory that partly explains this 'generalization through variance' phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with 'gaps' filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training. We also characterize how this inductive bias interacts with feature-related inductive biases.
This paper, "Generalization through variance: how noise shapes inductive biases in diffusion models" (Vastola, 16 Apr 2025), investigates the surprising ability of diffusion models to generalize beyond their training data, despite the fact that the standard Denoising Score Matching (DSM) objective's theoretical optimum is the score function of the training distribution itself, and the networks used are highly expressive. The authors propose that a key factor is the inherent variance in the target used by the DSM objective, which they call the "proxy score". This variance, particularly high in "boundary regions" between training examples, acts as a powerful inductive bias, enabling generalization.
The paper identifies six factors that collectively influence how diffusion models generalize:
Noisy objective: The proxy score target in DSM is a noisy estimate of the true score, with variance that is non-uniform across state space and high at low noise levels.
Forward process: The structure of the forward diffusion process affects the covariance structure of the proxy score.
Nonlinear score-dependence: The reverse process samples from a distribution that depends nonlinearly on the learned score function.
Model capacity: The relationship between model capacity and the number of training samples influences generalization vs. memorization.
Model features: Feature-related inductive biases from the model architecture interact with the variance from the noisy objective.
Training set structure: The spatial arrangement of training examples (e.g., presence of clusters vs. outliers) affects generalization, especially interpolation.
To analyze the "typical" learned distribution resulting from training, the authors introduce a theoretical approach based on a physics-inspired Martin-Siggia-Rose path integral formulation of the Probability Flow Ordinary Differential Equation (PF-ODE) used for sampling. By averaging this path integral representation over possible sample realizations used during training, they show that the "typical" learned distribution can be described by an effective Stochastic Differential Equation (SDE). This effective SDE has a drift term related to the average learned score and a noise term whose covariance is captured by what they term the "V-kernel". The V-kernel measures the ensemble variance of the learned score estimator.
The authors compute the V-kernel for several tractable models:
Naive score estimator: Even a hypothetical "naive" estimator that directly uses the noisy proxy score from sampled data points at each step results in a non-trivial V-kernel. This V-kernel is proportional to the proxy score covariance, indicating that generalization through variance can occur even without complex model-related biases, primarily affecting regions where the proxy score variance is high (i.e., boundary regions).
Expressive linear models: For linear score estimators with a fixed number of features (F), generalization only occurs if the number of features scales with the number of training samples (P) such that the ratio κ=F/P remains finite in the large P limit. The V-kernel in this case is an integral over the proxy score covariance, weighted and modulated by a "feature kernel" related to the model's feature basis. This shows how feature-related inductive biases interact with the noise-induced bias.
Lazy infinite-width neural networks (NTK regime): Networks in the lazy NTK regime also exhibit a V-kernel that depends on the proxy score covariance and is modulated by the network's spectral features (eigenfunctions of the Neural Tangent Kernel). In the limit of infinite training time, this result converges to the naive estimator's V-kernel scaled by κ, linking this theoretically tractable regime back to the fundamental variance phenomenon.
The paper discusses the consequences of this "generalization through variance":
Benign properties: The inductive bias is often beneficial. Single training points are not generalized (V-kernel is zero). Generalization tends to happen along the dimensionality of the data manifold. Variance is zero far from training data. The average sampling path follows the deterministic PF-ODE, ensuring modes near training data are still likely.
Gap-filling: The V-kernel's sensitivity to boundaries suggests that the model effectively fills in gaps between training examples. This effect is influenced by factors like the time cutoff ϵ and the F/P ratio (model capacity relative to data size). Figure 2 illustrates this gap-filling in 1D, showing how ϵ and F/P affect the resulting distribution.
Feature-noise alignment: The model's features determine how the proxy score covariance is integrated across the state space, leading to different generalization patterns. Figure 3 demonstrates how the same 2D dataset can be generalized into different shapes (square vs. cross) depending on the feature set and data orientation.
Connection to memorization: In a small noise limit (semiclassical approximation), the learned distribution is approximately the noise-corrupted data distribution multiplied by a factor related to the curvature of the "classical action". This factor, influenced by the V-kernel, controls the extent of memorization. The theory suggests that outliers or highly duplicated data points are less likely to be generalized because they degrade the boundary structure important for the proxy score covariance.
The authors connect their theoretical findings to empirical observations in the literature, such as diffusion models producing smooth distributions despite noisy score estimates, the utility of certain training weights (λt), and the influence of architecture (like CNNs) on inductive biases. They highlight that while generalization through variance is powerful and explains phenomena like interpolation and feature blending, it can also be harmful if it blends undesirable modes.
The paper concludes by acknowledging its limitations, including focusing on unconditional, non-latent models, simplified training dynamics (full-batch gradient descent), and using theoretically tractable models (linear, NTK) rather than realistic architectures like U-nets. However, they argue that the introduced theoretical framework and the concept of "generalization through variance" provide a valuable foundation for understanding diffusion model generalization.