Unpaired Image-to-Image Translation via Neural Schrödinger Bridge (2305.15086v3)

Published 24 May 2023 in cs.CV, cs.AI, cs.LG, and stat.ML

Abstract: Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. While diffusion models have achieved remarkable progress, they have limitations in unpaired image-to-image (I2I) translation tasks due to the Gaussian prior assumption. Schr\"{o}dinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. Yet, to our best knowledge, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose Unpaired Neural Schr\"{o}dinger Bridge (UNSB), which expresses the SB problem as a sequence of adversarial learning problems. This allows us to incorporate advanced discriminators and regularization to learn a SB between unpaired data. We show that UNSB is scalable and successfully solves various unpaired I2I translation tasks. Code: \url{https://github.com/cyclomon/UNSB}

References (55)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces UNSB, a novel method that reformulates the Schrödinger Bridge problem as a sequence of adversarial learning tasks for unpaired image translation.
The paper employs a time-conditional neural network with multi-step refinement to progressively generate high-resolution images while mitigating dimensionality challenges.
Experimental results demonstrate that UNSB outperforms GAN and diffusion-based methods in FID and KID scores on benchmarks such as Horse2Zebra and Map2Cityscape.

This paper introduces the Unpaired Neural Schrödinger Bridge (UNSB), a novel approach for unpaired image-to-image translation, particularly effective for high-resolution images. The core challenge addressed is the difficulty of traditional diffusion models in unpaired settings due to their fixed Gaussian prior and the limitations of existing Schrödinger Bridge (SB) methods, which struggle with scalability and the curse of dimensionality in high-dimensional data spaces like images.

UNSB formulates the SB problem, which aims to find the most likely stochastic process bridging two arbitrary distributions, as a sequence of adversarial learning problems. This is inspired by the self-similarity property of SBs, which states that an SB restricted to a sub-interval is also an SB. The authors discretize the time interval $[0,1]$ into steps $t_0, \ldots, t_N$ and learn the transition $p(\bm{x}_{t_{i+1}}|\bm{x}_{t_i})$ sequentially, forming a Markov chain from the source distribution at $t_0$ to the target distribution at $t_N$ .

The key idea is to learn a time-conditional neural network $q_\phi(\bm{x}_1 | \bm{x}_{t_i})$ for each step, which predicts the target image $\bm{x}_1$ given an intermediate image $\bm{x}_{t_i}$ . Learning this mapping is posed as a constrained optimization problem: minimizing an entropy-regularized transport cost between the intermediate distribution and the target distribution, subject to the constraint that the marginal distribution of the predicted target images matches the true target distribution $\pi_1$ . This constrained problem is translated into a Lagrangian formulation combining:

An Adversarial Loss ( $L_{Adv}$ ): Estimated via adversarial learning, this loss ensures the distribution of predicted target images $q_\phi(\bm{x}_1)$ matches the true target distribution $p(\bm{x}_1)$ . This allows the use of advanced discriminators (like patch-wise discriminators) to better capture the target distribution's characteristics, mitigating the curse of dimensionality compared to methods relying solely on empirical data matching.
An SB Loss ( $L_{SB}$ ): Related to the entropy-regularized transport cost, approximated using mutual information estimation.
A Regularization Loss ( $L_{Reg}$ ): An application-specific loss that enforces consistency or structural similarity between the initial source image $\bm{x}_0$ and the predicted target image $\bm{x}_1(\bm{x}_{t_i})$ . This term incorporates inductive biases relevant to the translation task (e.g., using a patch-wise contrastive loss inspired by CUT (2007.08971)).

By combining these components, UNSB aims to learn a mapping that generalizes well beyond the limited samples available in high dimensions. The multi-step generation process, obtained by iteratively sampling from the learned transition kernels $q_\phi(\bm{x}_{t_{i+1}}|\bm{x}_{t_i})$ , allows for gradual refinement of the output image, which is beneficial for complex translations.

For practical implementation, the authors use a single time-conditional DNN for $q_\phi$ , taking $(\bm{x}_{t_i}, t_i)$ as input, and train it by sampling random time steps. Intermediate samples $\bm{x}_{t_i}$ are generated by simulating the learned Markov chain from the source distribution $\pi_0$ . The adversarial learning involves training a discriminator alongside the generator.

Experimental results demonstrate UNSB's effectiveness. On toy examples like translating between two concentric spheres, UNSB shows robustness to the curse of dimensionality where other SB/OT methods fail. For high-resolution (256x256) unpaired image-to-image translation tasks (Horse2Zebra, Map2Cityscape, Summer2Winter, Map2Satellite), UNSB achieves superior FID and KID scores compared to various GAN-based methods (CycleGAN (1703.10593), MUNIT (1804.04732), CUT (2007.08971)) and diffusion/SB baselines (NOT (2303.10116), SDEdit (2108.01073), P2P (2208.01626)). Qualitative results show UNSB generates more realistic target domain images while preserving source structure.

The number of function evaluations (NFE), which corresponds to the number of time steps used in the generation process, impacts quality. While NFE=1 (analogous to a single-step GAN) yields reasonable results, increasing NFE (typically 3-5 steps) consistently improves quality, reflecting the benefits of the multi-step refinement. Ablation studies confirm that the advanced discriminator, regularization, and multi-step generation each contribute positively to the performance. UNSB also demonstrates stochasticity, producing diverse outputs for a given input.

In summary, UNSB provides a practical framework for applying Schrödinger Bridges to challenging unpaired image-to-image translation problems in high dimensions by addressing the curse of dimensionality through adversarial learning, regularization, and a multi-step generative process. The implementation involves training a time-conditional generator and a discriminator using a combined objective function. While computationally more intensive than single-step methods, it achieves state-of-the-art performance on various unpaired translation benchmarks. The code is available for reproduction.

PDF Markdown

Related Papers

GitHub

GitHub - cyclomon/UNSB: Official Repository of "Unpaired Image-to-Image Translation via Neural Schrödinger Bridge" (ICLR 2024) (216 stars)