COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder (2007.07431v3)

Published 15 Jul 2020 in cs.CV

Abstract: Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successful, existing few-shot image-to-image translation models find it difficult to preserve the structure of the input image while emulating the appearance of the unseen domain, which we refer to as the content loss problem. This is particularly severe when the poses of the objects in the input and example images are very different. To address the issue, we propose a new few-shot image translation model, COCO-FUNIT, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. Through extensive experimental validations with comparison to the state-of-the-art, our model shows effectiveness in addressing the content loss problem. For code and pretrained models, please check out https://nvlabs.github.io/COCO-FUNIT/ .

Citations (79)

View on Semantic Scholar

Summary

The paper introduces a content-conditioned style encoder to address the content loss problem in few-shot unsupervised image translation.
It employs a network with a content encoder, a COCO style encoder with constant style bias, and an image decoder to enhance structure preservation.
Experimental validation on diverse datasets shows improved mFID, PAcc, and mIoU scores over previous baselines, confirming its robustness.

COCO-FUNIT: Few-Shot Unsupervised Image Translation

The COCO-FUNIT model, proposed by Saito et al., addresses significant challenges in the domain of few-shot unsupervised image-to-image translation. Traditional models in this space often grapple with preserving content structure while adapting styles from unseen domains—a deficiency termed the "content loss" problem. The paper introduces an innovative approach utilizing a content-conditioned style encoder to mitigate this issue.

Context and Background

Few-shot unsupervised image-to-image translation aims to map an image from one domain to another using minimal examples from the target domain, without explicit paired supervision. Despite advances, existing methods face difficulties in maintaining content integrity, particularly when source and example images vary significantly in pose.

Proposed Methodology

The COCO-FUNIT model tackles the content loss problem through a novel network architecture featuring the content-conditioned style encoder (COCO). This encoder computes style embeddings conditioned on the input content image, reducing transmission of irrelevant appearance information.

The model consists of three main components: a content encoder, the COCO style encoder, and an image decoder. The COCO encoder incorporates a constant style bias (CSB), enhancing the style code robustness to small variations in the example images.

Experimental Validation

The authors validate COCO-FUNIT on diverse and challenging datasets (Carnivores, Mammals, Birds, and Motorbikes), characterized by substantial pose and appearance variations. The model demonstrated significant improvements in both style faithfulness and content preservation compared to the previous FUNIT baseline.

Quantitative metrics such as mFID, PAcc, and mIoU, alongside human preference studies, strongly favor COCO-FUNIT, underscoring its effectiveness. By outperforming the baseline across various datasets, COCO-FUNIT proves robust in preserving content while adapting styles.

Implications and Future Directions

COCO-FUNIT's contributions lie in its redesigned style encoder architecture. By conditioning style computation on the content image, the model addresses key deficiencies in previous models, presenting a more reliable approach to few-shot image-to-image translation.

The paper also explores the potential for style interpolation, suggesting applications in generating novel styles from existing domain blends. Future research might further refine the encoder design or extend the model's applicability to other complex translation tasks.

Overall, COCO-FUNIT marks a significant advancement in few-shot image translation, presenting a methodologically sound and empirically validated approach to overcoming content preservation challenges in unsupervised settings.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/arunmallya/status/1448339932821426177

https://twitter.com/liu_mingyu/status/1283629556507664384

https://twitter.com/dvsch/status/1310261933015289856

YouTube

Show All Videos