FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

Published 18 Oct 2024 in cs.CV, cs.AI, and cs.LG | (2410.14429v1)

Abstract: Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

Abstract PDF HTML Upgrade to Chat

Authors (7)

Summary

The paper proposes FashionR2R, a diffusion-based framework using Domain Knowledge Injection and Texture-preserving Attention Control to translate rendered fashion images into realistic versions while preserving textures.
Extensive experiments on SynFashion and Face Synthetics datasets show FashionR2R significantly outperforms state-of-the-art methods in realism and texture fidelity metrics like KID, LPIPS, and SSIM.
This research advances image translation for fashion e-commerce and digital content, showcasing diffusion models' potential while suggesting future work on computational efficiency and broader applicability.

FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

The paper "FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models" presents a novel diffusion-based framework aimed at enhancing the photorealism of rendered fashion images. The framework addresses the inherent challenges of translating rendered images, which are often limited in realism due to imperfections in 3D models and rendering algorithms, into realistic counterparts that maintain fidelity to the original textures and designs.

Overview

The proposed method is structured around two core components: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In the DKI phase, the approach involves injecting knowledge from both the rendered and real domains into a pretrained Text-to-Image (T2I) diffusion model. This is accomplished through positive domain finetuning using real fashion images and negative domain embedding optimized with a large set of rendered images, thereby enabling the model to deviate from the rendered domain characteristics effectively.

In the RIG phase, the framework employs a Texture-preserving Attention Control (TAC) mechanism, which leverages the self-attention features in the shallow layers of the UNet architecture. This mechanism facilitates the preservation of fine-grained texture details in the clothing during the rendered-to-real image translation process.

Methodology

The methodology is distinguished by the use of pretrained diffusion models, capitalizing on their generative power for domain translation tasks:

Domain Knowledge Injection (DKI): The strategy involves finetuning the base model on real fashion photos to enhance its capacity for generating realistic images. The negative domain embedding, obtained through optimization on rendered images, guides the model away from rendering artifacts during the denoising process, thereby producing more authentic image output.
Realistic Image Generation (RIG): This stage incorporates DDIM inversion to convert rendered images into latent space, allowing for the reconstruction of realism in the denoised output image. The innovation of incorporating attention control ensures that detailed textures from the original image are retained, thus achieving a balance between realism and detail preservation.

Experimental Results

The authors conducted extensive experiments on the SynFashion dataset, a newly introduced dataset composed of high-quality rendered fashion images, and the Face Synthetics dataset. The results, evaluated using metrics such as Kernel Inception Distance (KID), Learned Perceptual Image Patch Similarity (LPIPS), and Structural Similarity Index (SSIM), show significant improvements over existing methods like CUT, SANTA, VCT, and UNSB in terms of both realism and texture fidelity.

Moreover, the user studies conducted underscore the method's preference over competitors concerning perceived realism, image quality, and semantic consistency, with quantitative results reflecting these advantages in both human faces and digital clothing scenarios.

Implications and Future Directions

This research advances the state-of-the-art in image translation by effectively bridging the gap between computer-generated images and their realistic counterparts, crucial for applications in fashion e-commerce and digital content creation. The framework's reliance on diffusion models points to a broader application potential in other domains requiring high fidelity image translation. Future work could explore further optimization techniques to reduce computational demands and expand the applicability to real-time applications. Additionally, investigating more advanced base models and exploring inversion-free methods could enhance the translation speed and quality.

Overall, this paper contributes a significant methodological advancement to image translation tasks, providing a robust solution that harmonizes the generative strengths of diffusion models with domain-specific requirements.

Markdown Report Issue