- The paper introduces UNIT-DDPM, a novel approach that uses dual-domain denoising diffusion for unpaired image translation to ensure stable and realistic outputs.
- It employs conditional reverse Markov chains and cycle-consistency loss to accurately model joint image distributions between two domains.
- Evaluated on tasks like facades and seasonal translations, UNIT-DDPM significantly outperforms GAN-based methods in both FID scores and visual quality.
The paper "UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models" (2104.05358) introduces a novel method for unpaired image-to-image (I2I) translation that leverages Denoising Diffusion Probabilistic Models (DDPMs) instead of the commonly used Generative Adversarial Networks (GANs). The core motivation is to achieve more stable training and generate higher-quality images compared to GAN-based approaches, which often suffer from training instability and potential mode collapse.
The proposed method, UNIT-DDPM, learns the translation between two image domains (XA and XB) by training a generative model to approximate the joint distribution of images over both domains. This is framed as a Markov chain process, similar to standard DDPMs, but with the reverse (denoising) process in each domain conditioned on the image from the other domain.
Core Concepts and Methodology
- Dual-Domain Markov Chain: The method defines reverse Markov chains pθAA and pθBB for domains A and B, respectively. Unlike standard DDPMs where the reverse process pθ(xt−1∣xt) only depends on the noisy image xt at time t, UNIT-DDPM makes these transitions conditional on the noisy state of the image in the other domain. Specifically, the transition for domain A is pθAA(xt−1A∣xtA,x~tB) and for domain B is pθBB(xt−1B∣xtB,x~tA). Here, x~tB represents a noisy version of an image originating from domain B, and x~tA from domain A.
- Training Process:
- The training involves simultaneously optimizing two types of models:
- The parameters θA,θB of the conditional denoising models (which predict the noise ϵA,ϵB).
- The parameters ϕA,ϕB of auxiliary domain translation functions gϕAA:XA→XB and gϕBB:XB→XA. These translation functions are only used during training to generate pseudo-translated images for conditioning. For example, when training the domain A denoising model ϵθAA on a real image x0A, it is conditioned on a noisy version of a translated image gϕBB(x0B) from domain B.
- The optimization objective is based on the denoising score matching (DSM) principle from DDPMs, minimizing the difference between the predicted noise and the actual noise added during the forward process.
- The loss for θA,θB (Equation 9) involves denoising real samples from one domain conditioned on noisy translated samples from the other domain.
- The loss for ϕA,ϕB (Equation 10) encourages the translation functions to produce outputs such that their noisy versions serve as effective conditioning for the denoising models when paired with either real or translated noisy samples from the other domain.
- Crucially, a cycle-consistency loss (Equation 11), similar to CycleGAN, is added to the ϕ training objective (Equation 12) to enforce bijectivity between the domains via gϕAA and gϕBB.
- Algorithm 1 outlines this iterative training process where θ and ϕ are updated based on their respective loss functions using noisy versions of real and translated images.
- Inference (Image Translation):
- The domain translation functions gϕAA and gϕBB are discarded after training.
- Translation from source domain A (XA) to target domain B (XB) is performed using the trained denoising models ϵθAA and ϵθBB via a conditioned Markov Chain Monte Carlo (MCMC) sampling process based on Langevin dynamics (Algorithm 2).
- Starting from pure noise in domain B (x^TB∼N(0,I)), the algorithm iteratively denoises the image using ϵθBB.
- The conditioning comes from the input source image x0A. For a certain number of initial steps (from t=T down to a "release time" tr), a noisy version of the input x0A, denoted x^tA, is generated using the forward diffusion process (Equation 5).
- For steps t>tr, the domain B denoiser ϵθBB uses this forward-diffused x^tA as conditioning to denoise x^tB to get x^t−1B.
- For steps t≤tr, both domain A and domain B images are progressively denoised, with the domain B denoiser ϵθBB conditioned on x^tA, and the domain A denoiser ϵθAA conditioned on x^tB. This mutual conditioning guides the denoising process to produce a domain B image that corresponds to the input domain A image.
- The final output x^0B is the result of this progressive denoising and conditioning process.
Implementation Details
- The denoising models (ϵA,ϵB) are implemented using a U-Net architecture inspired by PixelCNN and Wide ResNet, incorporating sinusoidal position embeddings to encode the timestep t. Practical implementation choices include using ReLU and Batch Normalization instead of Swish, Group Normalization, and self-attention found in some other DDPM models, likely for computational efficiency.
- The auxiliary domain translation functions (gA,gB) used only during training are implemented with a ResNet architecture.
- Training involves a large number of timesteps (T=1000) with a linear variance schedule αt. The optimization uses Adam.
- The cycle-consistency loss weight λcyc is a hyperparameter.
- During inference, the "release time" tr controls when the reverse process for the source domain image is introduced. The paper defaults to tr=1 for evaluations, but notes its impact can be dataset-dependent.
Practical Applications and Evaluation
UNIT-DDPM is applicable to various unpaired I2I translation tasks where preserving content while changing domain-specific style or structure is desired. Examples demonstrated in the paper include:
- Semantic labels to building photos (Facades)
- Maps to satellite photos (Photos--Maps)
- Summer to winter scenes
- RGB visible images to thermal infrared images (RGB--Thermal)
The method was evaluated quantitatively using the Fréchet Inception Distance (FID) and qualitatively through visual inspection. It achieved state-of-the-art FID scores on several benchmark datasets, significantly outperforming contemporary GAN-based and other unpaired I2I methods like CycleGAN, UNIT, MUNIT, and DRIT++. The qualitative results also indicate more realistic and higher-quality generated images. A key advantage highlighted is the stable model training without the complexities often associated with GAN training.
Limitations and Future Work
The paper acknowledges limitations:
- Resolution Scaling: The current architecture struggles with higher-resolution images (e.g., 256x256), potentially failing to capture global image information. Future work could involve architectural changes like adding more layers or attention mechanisms to better handle higher dimensions.
- Inference Speed: Like most DDPMs, the sequential MCMC sampling process is significantly slower than single-pass GAN generators. Potential solutions mentioned include using faster sampling techniques like Denoising Diffusion Implicit Models (DDIMs) or optimizing the variance schedule Σθ to reduce the number of required timesteps.
Future research directions include improving the model's capability for higher resolutions, accelerating the sampling process, and evaluating the utility of the translated images for downstream vision tasks.