Generative Modeling by Estimating Gradients of the Data Distribution (1907.05600v3)

Published 12 Jul 2019 in cs.LG and stat.ML

Abstract: We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.

Citations (3,147)

View on Semantic Scholar

Summary

The paper introduces a novel generative model using annealed Langevin dynamics and score matching to estimate data distribution gradients efficiently.
The approach trains a single noise conditional score network to simultaneously learn scores over multiple noise levels for robust sampling.
Empirical results on CIFAR-10 achieve an inception score of 8.87 and a competitive FID of 25.32, demonstrating the model's practical effectiveness.

The paper introduces a novel generative modeling framework leveraging Langevin dynamics with score matching for estimating gradients of the data distribution. The method addresses challenges related to data residing on low-dimensional manifolds by perturbing the data with Gaussian noise at varying levels and jointly estimating the corresponding scores. The authors propose an annealed Langevin dynamics approach, utilizing gradients corresponding to gradually decreasing noise levels as the sampling process converges to the data manifold.

The key components of the work include:

A new generative model based on Langevin dynamics and score matching.
Score estimation is performed using a neural network trained with score matching to learn the vector field from data.
An annealed version of Langevin dynamics is introduced, starting with scores corresponding to the highest noise level and gradually annealing down the noise level.

Score-based generative modeling involves two main steps: score matching and Langevin dynamics.

Score Matching: The score of a probability density $p(x)$ is defined as $\nabla_x \log p(x)$ . Score matching trains a score network $s_\theta(x)$ to estimate $\nabla_x \log p_\text{data}(x)$ . The objective function is:

$\qquad \mathbb{E}_{p_\text{data}(x)}\bigg[\operatorname{tr}(\nabla_{x} s_\theta(x)) + \frac{1}{2}\| s_\theta(x)\|_2^2\bigg],$

where $\nabla_x s_\theta(x)$ denotes the Jacobian of $s_\theta(x)$ .
Denoising Score Matching: A variant of score matching that perturbs the data point $x$ with a pre-specified noise distribution $q_\sigma(\tilde{x} \mid x)$ and then employs score matching to estimate the score of the perturbed data distribution. The objective is:

$\qquad \frac{1}{2}\mathbb{E}_{q_\sigma(\tilde{x}\mid x)p_\text{data}(x)}[\| s_\theta(\tilde{x}) - \nabla_{\tilde{x} \log q_\sigma(\tilde{x}\mid x) \|_2^2].$
Sliced Score Matching: Sliced score matching uses random projections to approximate $\operatorname{tr}(\nabla_{x} s_\theta(x))$ in score matching. The objective is

$\qquad \mathbb{E}_{p_v} \mathbb{E}_{p_\text{data}\bigg[v^\intercal \nabla_x s_\theta(x) v + \frac{1}{2} \|s_\theta(x)\|_2^2 \bigg],$

where $p_v$ is a simple distribution of random vectors, e.g, the multivariate standard normal.
Langevin Dynamics: Langevin dynamics produces samples from a probability density $p(x)$ using only the score function $\nabla_x \log p(x)$ . Given a fixed step size $\epsilon > 0$ , the Langevin method recursively computes:

$\qquad \tilde{x}_{t} = \tilde{x}_{t-1} + \frac{\epsilon}{2}\nabla_x \log p(\tilde{x}_{t-1}) + \sqrt{\epsilon} z_t$ ,

where $z_t \sim \mathcal{N}(0, I)$ .

To address the challenges of score-based generative modeling, specifically the manifold hypothesis and the presence of low data density regions, the authors propose Noise Conditional Score Networks (NCSN)

Noise Conditional Score Networks (NCSN): The data is perturbed using various levels of noise, and scores corresponding to all noise levels are simultaneously estimated by training a single conditional score network. The noise levels are chosen such that $\sigma_1$ is large enough to mitigate the difficulties and $\sigma_L$ is small enough to minimize the effect on data.
Learning NCSNs via Score Matching: Both sliced and denoising score matching can train NCSNs. The denoising score matching objective for a given $\sigma$ is:

$\qquad \ell(\theta; \sigma) \triangleq \frac{1}{2}\mathbb{E}_{p_\text{data}(x)} \mathbb{E}_{\tilde{x}\sim \mathcal{N}(x,\sigma^2 I)} \bigg[\| s_\theta(\tilde{x}, \sigma) + \frac{\tilde{x} - x}{\sigma^2}\|_2^2\bigg].$

This is combined for all $\sigma \in \{\sigma_i\}_{i=1}^L$ to get a unified objective.

$\qquad \mathcal{L}(\theta; \{\sigma_i\}_{i=1}^L) \triangleq \frac{1}{L}\sum_{i=1}^L \lambda(\sigma_i)\ell(\theta; \sigma_i).$

The coefficient function is chosen as $\lambda(\sigma) = \sigma^2$ .
NCSN inference via annealed Langevin dynamics: After training the NCSN $s_\theta(x, \sigma)$ , annealed Langevin dynamics is used to generate samples. The sampling approach starts by initializing samples from a fixed prior distribution and then running Langevin dynamics to sample from $q_{\sigma_1}(x)$ with step size $\alpha_1$ . Subsequently, Langevin dynamics are run to sample from $q_{\sigma_2}(x)$ , starting from the final samples of the previous simulation with a reduced step size $\alpha_2$ . This process continues until $q_{\sigma_L}(x)$ is sampled, which is close to $p_\text{data}(x)$ when $\sigma_L \approx 0$ .

The authors present image generation results on MNIST, CelebA, and CIFAR-10 datasets. On CIFAR-10, the model achieves a state-of-the-art inception score of 8.87 for unconditional generative models and a competitive FID score of 25.32. Additionally, they demonstrate the model's ability to learn effective representations via image inpainting experiments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/norpadon/status/1842647777840337236

YouTube

Show All Videos