General Image-to-Image Translation with One-Shot Image Guidance (2307.14352v3)

Published 20 Jul 2023 in cs.CV

Abstract: Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator.

Authors (4)

Bin Cheng (74 papers)
Zuhao Liu (4 papers)
Yunbo Peng (6 papers)
Yue Lin (41 papers)

Citations (26)

View on Semantic Scholar

Summary

The paper presents the Visual Concept Translator (VCT), which preserves source content while integrating reference visual concepts through novel inversion techniques.
VCT leverages Pivotal Turning Inversion (PTI) and Content-Concept Fusion (CCF) in a dual-stream denoising process to enhance diverse image synthesis tasks.
Extensive experiments demonstrate VCT's adaptability across various image types, streamlining content generation without extensive retraining.

Overview of General Image-to-Image Translation with One-Shot Image Guidance

The paper presents a novel framework called the Visual Concept Translator (VCT) for general image-to-image (I2I) translation guided by a single reference image. The authors address the limitations of existing methods in the field, specifically concerning their inability to effectively preserve content from source images while seamlessly integrating visual concepts from reference images. The paper leverages recent advances in large-scale diffusion models, like Latent Diffusion Models (LDM), and introduces innovative mechanisms such as Pivotal Turning Inversion (PTI) and Content-Concept Fusion (CCF), which significantly enhance image synthesis tasks by minimizing manual input and computation needs.

Key Insights and Contributions

Novel Framework for Visual Translation: The Visual Concept Translator (VCT) offers a significant advance in I2I translation by maintaining the integrity of the source image's content while effectively integrating new visual concepts. This is achieved through a systematic inversion of visual data, whereby essential features are extracted from both the source and reference images and are synthesized within a dual-stream denoising process.
Content-Concept Inversion (CCI) and Fusion (CCF): The proposed framework introduces two main processes: CCI and CCF. CCI utilizes Pivotal Turning Inversion to extract contents and multi-concept inversion to capture visual concepts from reference images. Then, CCF employs a dual-stream architecture for denoising, ensuring that synthesized images retain the desired characteristics from both input sources.
Generalization and Versatility: One key claim of the paper is the VCT's capacity to perform a range of I2I tasks without requiring extensive retraining across new datasets or conditions. The proposed method allows for the synthesis of diverse image types, from artistic creations to virtual reality renderings, demonstrating its adaptability and potential use in numerous applications.
Extensive Experimental Validation: The authors provide comprehensive experimental results to validate the superiority of their method over existing frameworks. These experiments span various translation tasks, highlighting the framework’s robustness in maintaining both content fidelity and visual coherence across different domains.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, by reducing the need for large numbers of samples or extensive retraining, VCT could streamline workflows in creative industries such as game design, film, and art, where custom content generation is valuable. Theoretically, the success of VCT reaffirms the potential of diffusion models in complex generative tasks and challenges the dominance of approaches such as GANs in adaptability and ease of use.

In terms of future developments, advancements could focus on optimizing the computational efficiency of the VCT framework and exploring its applications beyond visual styles, such as dynamic or interactive content synthesis. Additionally, incorporating elements of AI explainability could help users better understand and control the nuanced manipulations that VCT performs, leading to broader acceptance and trust in AI-generated outputs.

This paper represents a meaningful contribution to the field of generative AI, highlighting the possibilities afforded by combining state-of-the-art diffusion mechanisms with innovative embedding and fusion strategies. It sets a precedent for further exploration into more intuitive, real-image-guided synthesis models in AI research.

PDF Markdown

Related Papers

GitHub

GitHub - CrystalNeuro/visual-concept-translator: Code of ICCV 2023 paper titled General Image-to-Image Translation with One-Shot Image Guidance (170 stars)