Dual Diffusion Implicit Bridges for Image-to-Image Translation (2203.08382v4)

Published 16 Mar 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Common image-to-image translation methods rely on joint training over data from both source and target domains. The training process requires concurrent access to both datasets, which hinders data separation and privacy protection; and existing models cannot be easily adapted for translation of new domain pairs. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models, that circumvents training on domain pairs. Image translation with DDIBs relies on two diffusion models trained independently on each domain, and is a two-step process: DDIBs first obtain latent encodings for source images with the source diffusion model, and then decode such encodings using the target model to construct target images. Both steps are defined via ordinary differential equations (ODEs), thus the process is cycle consistent only up to discretization errors of the ODE solvers. Theoretically, we interpret DDIBs as concatenation of source to latent, and latent to target Schrodinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method. Experimentally, we apply DDIBs on synthetic and high-resolution image datasets, to demonstrate their utility in a wide variety of translation tasks and their inherent optimal transport properties.

Authors (4)

Xuan Su (3 papers)
Jiaming Song (78 papers)
Chenlin Meng (39 papers)
Stefano Ermon (279 papers)

Citations (168)

View on Semantic Scholar

Summary

The paper introduces a decoupled two-step process that leverages independent diffusion models to translate images without requiring paired data.
It employs numerical integration of probability flow ODEs to encode images into a Gaussian latent space and decode them with near-perfect cycle consistency.
Experimental results across synthetic, color transfer, paired, and class-conditional tasks demonstrate competitive performance and efficient scalability.

This paper presents Dual Diffusion Implicit Bridges (DDIBs), a novel method for image-to-image translation that leverages diffusion models. Unlike traditional approaches like GANs and normalizing flows which typically require joint training on paired or unpaired data from both source and target domains, DDIBs decouple the training process. This approach offers significant advantages in terms of data privacy and adaptability across multiple domains.

The core idea of DDIBs is built upon denoising diffusion implicit models (DDIMs) (Song et al., 2020 ). DDIMs introduce a deterministic and reversible mapping between data points (images) and a latent space (typically a standard Gaussian distribution) via a probability flow (PF) ordinary differential equation (ODE). This ODE describes the continuous-time evolution of a data sample under the diffusion process.

DDIBs utilize two independently trained diffusion models, one for the source domain ( $p_s$ ) and one for the target domain ( $p_t$ ). Both models learn the underlying data distribution of their respective domains. Image translation from a source image $x^{(s)}$ to a target image $x^{(t)}$ with DDIBs is a two-step process:

Encoding: The source image $x^{(s)}$ is transported from time $t=0$ to $t=1$ in the source domain's latent space. This is achieved by solving the source diffusion model's PF ODE in the forward direction, starting from $x^{(s)}$ at $t=0$ . The result at $t=1$ is a latent code $x^{(l)}$ . This step can be written using the paper's ODESolve notation: $x^{(l)} = \mathrm{ODESolve}(x^{(s)}; v^{(s)}_\theta, 0, 1)$ , where $v^{(s)}_\theta$ is the velocity field learned by the source diffusion model.
Decoding: The latent code $x^{(l)}$ (which is treated as the state at time $t=1$ ) is then transported from time $t=1$ back to $t=0$ in the target domain. This is done by solving the target diffusion model's PF ODE in the reverse direction, starting from $x^{(l)}$ at $t=1$ . The result at $t=0$ is the translated image $x^{(t)}$ . This step is $x^{(t)} = \mathrm{ODESolve}(x^{(l)}; v^{(t)}_\theta, 1, 0)$ , where $v^{(t)}_\theta$ is the velocity field learned by the target diffusion model.

The ODESolve function represents numerical integration of the PF ODE $\dd{x} = v_\theta(t, x) \dd{t}$, where $v_\theta$ is the learned velocity field derived from the diffusion model's noise prediction network. The paper implements this using the specific Euler-like discretization proposed in DDIMs (Song et al., 2020 ).

A significant practical advantage of DDIBs is the decoupled training. Since the source and target diffusion models are trained independently on their respective datasets, there is no need for paired data or simultaneous access to both datasets during training. This inherently supports data privacy; for example, one party owning the source data can train their model, encode their data, send only the latent codes to the other party owning the target data, who then decodes them. Only latent codes and final translated images are exchanged, not the raw datasets (illustrated in Appendix A).

Furthermore, this decoupled training structure means that adding a new domain to a set of existing domains only requires training one new diffusion model for that domain, rather than training a new translation model for every pair involving the new domain. This leads to a linear scaling ( $O(N)$ models for $N$ domains) compared to the quadratic scaling ( $O(N^2)$ models) often required by pairwise methods like CycleGAN [zhu2017unpaired]. For domains where conditional diffusion models can be trained on multiple classes simultaneously (e.g., ImageNet [dhariwal2021diffusion]), the number of models needed can be even further reduced.

Theoretically, the paper interprets DDIBs as a concatenation of two Schr\"odinger Bridges (SBPs). SGMs and their PF ODEs are shown to be equivalent to special types of SBPs (with linear or degenerate drift) where one endpoint distribution is Gaussian [chen2021likelihood]. The encoding step (data to Gaussian latent) corresponds to an SBP from the source data distribution to a Gaussian prior, and the decoding step (Gaussian latent to data) corresponds to the reverse of an SBP from a Gaussian prior to the target data distribution. This optimal transport perspective suggests that DDIBs perform translation by traversing paths that minimize a form of entropy-regularized transport cost between the distributions.

A key property guaranteed by using reversible ODEs is exact cycle consistency: translating an image from domain A to B and then back to A theoretically recovers the original image exactly, assuming zero discretization error from the ODE solver. In practice, using numerical solvers introduces minor errors, but the paper demonstrates very low cycle inconsistency errors on 2D synthetic data (Table 1).

The paper demonstrates DDIBs on various tasks:

2D Synthetic Translation: Visualizations show smooth transformations between different 2D data distributions, and quantitative results show minimal L2 distance after cycle translation.
Example-Guided Color Transfer: By training a diffusion model on the colors of a reference image, DDIBs can transfer its color palette to other images. Comparisons (Table 2) show pixel-wise MSE similar to other optimal transport methods for color transfer. The approach requires training a separate diffusion model for each reference image used for color transfer.
Paired Domain Translation: Evaluated on datasets like Facades and Maps (which provide ground truth pairs for quantitative assessment), DDIBs achieve competitive pixel-wise MSE values compared to CycleGAN and AlignFlow (Table 3), despite the models being trained independently without access to the paired data. For these tasks, the paper uses a color conversion heuristic on the segmentation maps, motivated by optimal transport, to reduce color differences between domains before training the diffusion models (Appendix D).
Class-Conditional ImageNet Translation: Using pre-trained conditional diffusion models from Dhariwal et al. [dhariwal2021diffusion], DDIBs translate images between different ImageNet classes. The method maintains content like animal poses while adapting the image to the target class characteristics (Figures 3 & 4). This showcases the multi-domain capability using a single conditional model.

Practical considerations for implementing DDIBs include:

Training: Requires training standard diffusion models (specifically DDIM-style models) for each domain. This can be computationally intensive, but uses existing, well-established training procedures.
Inference: Translation requires solving two ODEs numerically. The number of steps for ODE solving impacts translation speed and accuracy. Faster ODE solvers could improve performance.
Model Choice: The performance depends heavily on the quality of the independently trained diffusion models for each domain. Pre-trained state-of-the-art models (like those used for ImageNet) are beneficial.
Limitations: The optimal transport nature might lead to translations that are statistically optimal but potentially visually undesirable in extreme cases (Appendix C). Applying the color transfer method to many images requires training many single-image models.

In summary, DDIBs provide a privacy-preserving and scalable approach to image-to-image translation by leveraging the deterministic mapping properties of diffusion model probability flow ODEs. By training diffusion models independently on each domain and performing translation via a two-step ODE solving process (encode to latent, decode to target), DDIBs circumvent the need for joint training and scale linearly with the number of domains. Its theoretical link to concatenated Schr\"odinger Bridges provides an optimal transport interpretation, while experiments demonstrate strong performance and cycle consistency on diverse image translation tasks.

PDF Markdown

Dual Diffusion Implicit Bridges for Image-to-Image Translation (2203.08382v4)

Summary

Related Papers