Noise-Conditioned Score Networks Explained

Updated 10 March 2026

The paper introduces a framework for approximating Stein scores via noise-conditioned score matching, enabling efficient sampling using annealed Langevin dynamics.
It leverages conditional instance normalization and multiplicative noise conditioning to stabilize training on low-dimensional data manifolds.
The analysis provides theoretical guarantees on optimization, sample complexity, and convergence by employing neural tangent kernel methods.

Noise-Conditioned Score Networks (NCSNs) are generative models that estimate the gradients of Gaussian-perturbed data distributions at multiple noise levels to enable efficient sampling by annealed Langevin dynamics. By jointly training a neural network to approximate the Stein scores of these noised data marginals, NCSNs overcome the ill-conditioning of gradients on low-dimensional data manifolds and facilitate the gradual denoising of samples from high-variance noise toward the original data distribution. This framework, introduced by Song & Ermon (2019), has been further studied through analysis of optimization and generalization properties of noise-conditioned score estimation, as well as theoretical insight into architectural restrictions such as multiplicative noise conditioning (Kim, 19 Jan 2026, Han et al., 2024, Song et al., 2019).

1. Mathematical Formulation of Noise-Conditioned Score Networks

Let $\mu_{\rm data}$ denote the data distribution on $\mathbb{R}^d$ . For any noise scale $\sigma>0$ , the Gaussian-perturbed data marginal is defined by

$p_\sigma(x) = \int_{\mathbb{R}^d} \frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left[-\frac{|x-\tilde{x}|^2}{2\sigma^2}\right]\,d\mu_{\rm data}(\tilde{x}).$

The objective is to estimate the (Stein) score function

$s^*(x,\sigma) = \nabla_x \log p_\sigma(x)$

for a sequence of noise levels $\{\sigma_i\}$ . This circumvents the ill-posedness of $\nabla \log p_{\rm data}(x)$ for data supported on low-dimensional manifolds, as the added Gaussian noise regularizes the density and makes the score well-defined everywhere (Song et al., 2019).

The noise-conditioned score network is parameterized as $s_\theta(x, \sigma)$ and trained to minimize the denoising score matching objective

$\ell(\theta;\sigma) = \frac{1}{2} \mathbb{E}_{\mu_{\rm data}(x_0)}\mathbb{E}_{x \sim \mathcal{N}(x_0,\sigma^2I)}\,\| s_\theta(x,\sigma) + \frac{x - x_0}{\sigma^2} \|^2,$

typically averaged across all noise levels with scale weights $\lambda(\sigma)$ (empirically $\lambda(\sigma) = \sigma^2$ ). The training procedure interleaves sampling across a geometric progression of $\sigma_i$ (Song et al., 2019).

2. Network Architectures and Noise Conditioning

NCSN architectures accept both input $x$ and the noise scale $\sigma$ and output a vector of the same shape as $x$ . In Song & Ermon (2019), score networks use U-Net or RefineNet architectures with residual blocks, dilated convolutions, ELU activations, and skip connections.

Noise conditioning is introduced via "conditional instance normalization," in which the normalization parameters (scale $\gamma$ and shift $\beta$ ) at each layer are specific to each $\sigma_i$ . Optionally, an embedding of $\sigma_i$ can be concatenated to feature maps. This explicit $\sigma$ -dependence enables the network to learn score fields for distinct noise levels in a single model (Song et al., 2019).

A restricted form of parameterization known as multiplicative noise conditioning is also common: $s_\theta(\sigma, x) = \sigma^{-\alpha} s_\theta(x)$ for some $\alpha > 0$ , so that the sole noise dependence is a scalar multiplicative factor. While this reduces model expressivity and prevents learning the true score $\nabla_x \log p_\sigma(x)$ , it significantly simplifies optimization and, as shown by Kim (2024), still leads to high-quality samples despite this restriction (Kim, 19 Jan 2026).

3. Sampling Procedures: Annealed Langevin Dynamics and Deterministic Flows

Once trained, NCSNs generate novel samples via annealed Langevin dynamics (ALD). The procedure is as follows:

Initialize $x_0$ from a high-noise distribution, typically uniform noise.
For a schedule of decreasing noise levels $\sigma_1 > \cdots > \sigma_L$ $σ_{1} > \dots > σ_{L}$ :
- For $T$ steps, update
$x_{t+1} = x_t + \frac{\alpha_i}{2}s_\theta(x_t,\sigma_i) + \sqrt{\alpha_i}\,\varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0,I)$

where $\alpha_i = \epsilon \cdot (\sigma_i^2/\sigma_L^2)$ is the step size for noise level $\sigma_i$ (Song et al., 2019).

This progressive denoising approach enables efficient mixing and refinement of the sample distribution. The sampling process can also be analyzed through deterministic trajectories. Both ALD and the probability-flow ODE can be reduced (by rescaling and removing stochasticity) to an autonomous ODE

$\frac{dx}{dt} = s_*(x) + e(x)$

with $e(x) = s_\theta(x) - s_*(x)$ and $s_*(x)$ the optimal static network derived from the training objective under multiplicative noise conditioning (Kim, 19 Jan 2026).

4. Theoretical Properties and Optimization Analysis

Training the score network amounts to a regression problem with noisy labels $(x_{0,j})$ corresponding to the original data points and noisy inputs $(x_{t_j})$ generated by the perturbation process. This introduces specific challenges, including vector-valued outputs, additional time (or noise) conditioning, and unbounded input domains (Han et al., 2024).

Analysis within the neural tangent kernel (NTK) regime shows that:

The evolution of two-layer ReLU networks under gradient descent mirrors kernel regression with an appropriate NTK.
Early stopping is essential for generalization, preventing overfitting to noisy regression targets, and minimax-optimal rates can be achieved under suitable regularity and sample complexity conditions.
Error decomposition identifies contributions from tail truncation, function approximation in the RKHS, finite network width, and label mismatch.

Crucially, these analyses establish the first algorithm-dependent (gradient descent, overparameterized network) generalization and sample complexity bounds for noise-conditioned score matching (Han et al., 2024).

5. Effects of Restricted Network Structures: Multiplicative Noise Conditioning

Imposing the structure $s_\theta(\sigma, x) = \sigma^{-\alpha} s_\theta(x)$ restricts the network's ability to capture arbitrary $\sigma$ -dependence, but admits tractable analysis. The optimal $s_*(x)$ can be written in closed form via the kernel function $\Phi$ , integrating across all noise levels and involving the data measure $\mu_{\rm data}$ : $s_*(x) = \frac{\int_{\mathbb{R}^d} \Phi_{\sigma_\varepsilon}^{\sigma_T}\left(\frac{d+\alpha}{2}, \frac{|x - \tilde{x}|^2}{2}\right)(\tilde{x} - x)\, d\mu_{\rm data}(\tilde{x})}{\int_{\mathbb{R}^d} \Phi_{\sigma_\varepsilon}^{\sigma_T}\left(\frac{d+2\alpha-2}{2}, \frac{|x - \tilde{x}|^2}{2}\right) d\mu_{\rm data}(\tilde{x})}.$ Despite not recovering the true score, the ODE $\dot x = s_*(x)$ drives the dynamics toward regions of high probability for the smoothed data potential $L(x)$ , with theoretical guarantees of global existence, uniqueness, and convergence to high-density attractors under mild assumptions [(Kim, 19 Jan 2026)