Contrastive Learning for Unpaired Image-to-Image Translation (2007.15651v3)

Published 30 Jul 2020 in cs.CV and cs.LG

Abstract: In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -- maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.

Citations (1,069)

View on Semantic Scholar

Summary

The paper introduces a novel contrastive loss that maps corresponding patches to similar features, bypassing cycle-consistency constraints.
It employs a patch-based strategy with internal negatives to enforce detailed content preservation across different image domains.
The method achieves lower FID scores and faster training efficiency on datasets like Horse-to-Zebra, Cat-to-Dog, and Cityscapes.

Contrastive Learning for Unpaired Image-to-Image Translation

This paper introduces a novel approach for unpaired image-to-image translation that leverages contrastive learning to maximize mutual information between corresponding patches in input and output images. The proposed method aims to enforce content preservation across different domains without the need for cycle-consistency constraints, which have been the de facto standard in unpaired image translation models.

Methodology

The core idea is to map corresponding patches from input and output images to similar points in a learned feature space while ensuring that non-corresponding patches are mapped to distant points. This is achieved through a contrastive loss based on the InfoNCE framework. The loss function encourages an embedding that brings an output patch closer to its corresponding input patch and further away from other patches within the same image (referred to as negative patches).

Two critical design choices are noted:

Patch-based Approach: Unlike previous methods that often operate on entire images, this method employs a multilayer, patch-based strategy. This is because the local patches provide a more granular level of detail, ensuring that the content is accurately preserved.
Internal Negatives: Instead of drawing negative samples from the rest of the dataset, which may introduce unnecessary complexity, the method samples negatives internally from the same input image. This approach forces the patches to better retain and reflect the content of the input.

Results

The paper demonstrates the effectiveness of the proposed method across several datasets, including the Horse-to-Zebra, Cat-to-Dog, and Cityscapes datasets. It shows notable improvements in terms of quality and training efficiency compared to state-of-the-art models:

Fréchet Inception Distance (FID): The model shows significantly lower FID scores, indicating that the generated images are more similar to real images in a perceptual feature space.
Training Efficiency: The proposed method, particularly its variant named FastCUT, achieves superior performance with reduced training time and memory usage, making it a practical choice for various applications.

Implications

The implications of this research are multifold:

Practical Applications: The improved training efficiency and quality make this method suitable for real-world applications where computational resources may be limited.
Theoretical Insights: The use of contrastive learning to maximize mutual information in image translation opens new avenues for understanding and improving content preservation mechanisms in GANs.

Future Directions

This work paves the way for several interesting future developments:

Extension to Video Translation: Given the success in still images, exploring the application of this method to video domain translation could be a potential direction.
Hybrid Models: Combining the strengths of cycle-consistency and contrastive learning could yield even better performance.
Other Domains: The approach could be tested on a broader range of domains, including medical imaging, where content preservation is critically important.

Conclusion

This paper presents a compelling approach to unpaired image-to-image translation by employing contrastive learning. By focusing on patchwise mutual information maximization, it improves translation quality and efficiency. The findings and methodologies introduced hold significant promise for diverse applications and further innovations in the field of image translation.

PDF Markdown

Related Papers

GitHub

GitHub - taesungp/contrastive-unpaired-translation: Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan (ECCV 2020, in PyTorch) (2,374 stars)

YouTube

Show All Videos