UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models (2104.05358v1)

Published 12 Apr 2021 in cs.CV, cs.LG, and eess.IV

Abstract: We propose a novel unpaired image-to-image translation method that uses denoising diffusion probabilistic models without requiring adversarial training. Our method, UNpaired Image Translation with Denoising Diffusion Probabilistic Models (UNIT-DDPM), trains a generative model to infer the joint distribution of images over both domains as a Markov chain by minimising a denoising score matching objective conditioned on the other domain. In particular, we update both domain translation models simultaneously, and we generate target domain images by a denoising Markov Chain Monte Carlo approach that is conditioned on the input source domain images, based on Langevin dynamics. Our approach provides stable model training for image-to-image translation and generates high-quality image outputs. This enables state-of-the-art Fr\'echet Inception Distance (FID) performance on several public datasets, including both colour and multispectral imagery, significantly outperforming the contemporary adversarial image-to-image translation methods.

Citations (139)

View on Semantic Scholar

Summary

The paper introduces UNIT-DDPM, a novel approach that uses dual-domain denoising diffusion for unpaired image translation to ensure stable and realistic outputs.
It employs conditional reverse Markov chains and cycle-consistency loss to accurately model joint image distributions between two domains.
Evaluated on tasks like facades and seasonal translations, UNIT-DDPM significantly outperforms GAN-based methods in both FID scores and visual quality.

The paper "UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models" (2104.05358) introduces a novel method for unpaired image-to-image (I2I) translation that leverages Denoising Diffusion Probabilistic Models (DDPMs) instead of the commonly used Generative Adversarial Networks (GANs). The core motivation is to achieve more stable training and generate higher-quality images compared to GAN-based approaches, which often suffer from training instability and potential mode collapse.

The proposed method, UNIT-DDPM, learns the translation between two image domains ( $\mathcal{X}^A$ and $\mathcal{X}^B$ ) by training a generative model to approximate the joint distribution of images over both domains. This is framed as a Markov chain process, similar to standard DDPMs, but with the reverse (denoising) process in each domain conditioned on the image from the other domain.

Core Concepts and Methodology

Dual-Domain Markov Chain: The method defines reverse Markov chains $p^A_{\theta^A}$ and $p^B_{\theta^B}$ for domains A and B, respectively. Unlike standard DDPMs where the reverse process $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ only depends on the noisy image $\mathbf{x}_t$ at time $t$ , UNIT-DDPM makes these transitions conditional on the noisy state of the image in the other domain. Specifically, the transition for domain A is $p^A_{\theta^A}(\mathbf{x}^A_{t-1} | \mathbf{x}^A_t, \tilde{\mathbf{x}}^B_t)$ and for domain B is $p^B_{\theta^B}(\mathbf{x}^B_{t-1} | \mathbf{x}^B_t, \tilde{\mathbf{x}}^A_t)$ . Here, $\tilde{\mathbf{x}}^B_t$ represents a noisy version of an image originating from domain B, and $\tilde{\mathbf{x}}^A_t$ from domain A.
Training Process:
- The training involves simultaneously optimizing two types of models:
  - The parameters $\theta^A, \theta^B$ of the conditional denoising models (which predict the noise $\boldsymbol\epsilon^A, \boldsymbol\epsilon^B$ ).
  - The parameters $\phi^A, \phi^B$ of auxiliary domain translation functions $g^A_{\phi^A}: \mathcal{X}^A \rightarrow \mathcal{X}^B$ and $g^B_{\phi^B}: \mathcal{X}^B \rightarrow \mathcal{X}^A$ . These translation functions are only used during training to generate pseudo-translated images for conditioning. For example, when training the domain A denoising model $\epsilon^A_{\theta^A}$ on a real image $\mathbf{x}^A_0$ , it is conditioned on a noisy version of a translated image $g^B_{\phi^B}(\mathbf{x}^B_0)$ from domain B.
- The optimization objective is based on the denoising score matching (DSM) principle from DDPMs, minimizing the difference between the predicted noise and the actual noise added during the forward process.
- The loss for $\theta^A, \theta^B$ (Equation 9) involves denoising real samples from one domain conditioned on noisy translated samples from the other domain.
- The loss for $\phi^A, \phi^B$ (Equation 10) encourages the translation functions to produce outputs such that their noisy versions serve as effective conditioning for the denoising models when paired with either real or translated noisy samples from the other domain.
- Crucially, a cycle-consistency loss (Equation 11), similar to CycleGAN, is added to the $\phi$ training objective (Equation 12) to enforce bijectivity between the domains via $g^A_{\phi^A}$ and $g^B_{\phi^B}$ .
- Algorithm 1 outlines this iterative training process where $\theta$ and $\phi$ are updated based on their respective loss functions using noisy versions of real and translated images.
Inference (Image Translation):
- The domain translation functions $g^A_{\phi^A}$ and $g^B_{\phi^B}$ are discarded after training.
- Translation from source domain A ( $\mathcal{X}^A$ ) to target domain B ( $\mathcal{X}^B$ ) is performed using the trained denoising models $\epsilon^A_{\theta^A}$ and $\epsilon^B_{\theta^B}$ via a conditioned Markov Chain Monte Carlo (MCMC) sampling process based on Langevin dynamics (Algorithm 2).
- Starting from pure noise in domain B ( $\hat{\mathbf{x}}^B_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ), the algorithm iteratively denoises the image using $\epsilon^B_{\theta^B}$ .
- The conditioning comes from the input source image $\mathbf{x}^A_0$ . For a certain number of initial steps (from $t=T$ down to a "release time" $t_r$ ), a noisy version of the input $\mathbf{x}^A_0$ , denoted $\hat{\mathbf{x}}^A_t$ , is generated using the forward diffusion process (Equation 5).
- For steps $t > t_r$ , the domain B denoiser $\epsilon^B_{\theta^B}$ uses this forward-diffused $\hat{\mathbf{x}}^A_t$ as conditioning to denoise $\hat{\mathbf{x}}^B_t$ to get $\hat{\mathbf{x}}^B_{t-1}$ .
- For steps $t \leq t_r$ , both domain A and domain B images are progressively denoised, with the domain B denoiser $\epsilon^B_{\theta^B}$ conditioned on $\hat{\mathbf{x}}^A_t$ , and the domain A denoiser $\epsilon^A_{\theta^A}$ conditioned on $\hat{\mathbf{x}}^B_t$ . This mutual conditioning guides the denoising process to produce a domain B image that corresponds to the input domain A image.
- The final output $\hat{\mathbf{x}}^B_0$ is the result of this progressive denoising and conditioning process.

Implementation Details

The denoising models ( $\epsilon^A, \epsilon^B$ ) are implemented using a U-Net architecture inspired by PixelCNN and Wide ResNet, incorporating sinusoidal position embeddings to encode the timestep $t$ . Practical implementation choices include using ReLU and Batch Normalization instead of Swish, Group Normalization, and self-attention found in some other DDPM models, likely for computational efficiency.
The auxiliary domain translation functions ( $g^A, g^B$ ) used only during training are implemented with a ResNet architecture.
Training involves a large number of timesteps ( $T=1000$ ) with a linear variance schedule $\alpha_t$ . The optimization uses Adam.
The cycle-consistency loss weight $\lambda_{\text{cyc}}$ is a hyperparameter.
During inference, the "release time" $t_r$ controls when the reverse process for the source domain image is introduced. The paper defaults to $t_r=1$ for evaluations, but notes its impact can be dataset-dependent.

Practical Applications and Evaluation

UNIT-DDPM is applicable to various unpaired I2I translation tasks where preserving content while changing domain-specific style or structure is desired. Examples demonstrated in the paper include:

Semantic labels to building photos (Facades)
Maps to satellite photos (Photos--Maps)
Summer to winter scenes
RGB visible images to thermal infrared images (RGB--Thermal)

The method was evaluated quantitatively using the Fréchet Inception Distance (FID) and qualitatively through visual inspection. It achieved state-of-the-art FID scores on several benchmark datasets, significantly outperforming contemporary GAN-based and other unpaired I2I methods like CycleGAN, UNIT, MUNIT, and DRIT++. The qualitative results also indicate more realistic and higher-quality generated images. A key advantage highlighted is the stable model training without the complexities often associated with GAN training.

Limitations and Future Work

The paper acknowledges limitations:

Resolution Scaling: The current architecture struggles with higher-resolution images (e.g., 256x256), potentially failing to capture global image information. Future work could involve architectural changes like adding more layers or attention mechanisms to better handle higher dimensions.
Inference Speed: Like most DDPMs, the sequential MCMC sampling process is significantly slower than single-pass GAN generators. Potential solutions mentioned include using faster sampling techniques like Denoising Diffusion Implicit Models (DDIMs) or optimizing the variance schedule $\Sigma_\theta$ to reduce the number of required timesteps.

Future research directions include improving the model's capability for higher resolutions, accelerating the sampling process, and evaluating the utility of the translated images for downstream vision tasks.

PDF Markdown

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models (2104.05358v1)

Summary

Related Papers