Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

Published 24 Jan 2024 in cs.CV | (2401.13795v1)

Abstract: As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.

Abstract PDF Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces DTC, a latent diffusion model that improves image-conditioned inpainting for Virtual Try-All by accurately preserving product details.
It employs a novel secondary U-Net with affine transformation layers to integrate pixel-level hints from reference images into the main decoder.
Empirical evaluations indicate that DTC outperforms prior approaches in fidelity and semantic coherence, enabling efficient real-time deployment.

Introduction

The increasing ubiquity of online shopping necessitates advancements in the field of virtually visualizing products within consumer environments, a concept operationalized as Virtual Try-All (Vit-All). The underlying premise of Vit-All models is their ability to semantically compose images by embedding an online catalog item into a user-provided environmental context whilst preserving the item's intrinsic details. An effective Vit-All model is predicated on three key conditions: operation within any 'in-the-wild' setting; seamless integration that maintains the product identity; and swift real-time performance suitable for large-scale deployment.

Model Development

Existing approaches have largely been task-specific or reliant on onerous 3D modeling, making scaling to vast product ranges impractical. Standard image-conditioned diffusion models historically fail to adequately retain fine details critical to product realization. In response, the development of Diffuse to Choose (DTC), a latent diffusion model, targets this gap. DTC ingeniously incorporates a secondary U-Net to channel pixel-level hints from the reference image into the main U-Net's decoder, overseen by affine transformation layers. This integration ensures the fidelity of product-specific details and their cohesive contextual blend. Notably, unlike existing approaches, DTC is architected to operate in a zero-shot framework.

Empirical Evaluation

Comprehensive evaluations of DTC confirm its superior performance over previous image-conditioned inpainting models like Paint By Example (PBE) and few-shot personalization algorithms such as DreamPaint. The architecture uses perceptual loss to harmonize low-level image features and a larger, image-only encoder from DINOV2 to expand the model's capacity. DTC is systematically refined, leveraging all CLIP patches for enhanced depiction of item details. Human-centered studies further benchmark DTC against competing methodologies, corroborating its proficiency in the Vit-All domain in both fidelity and semantically coherent product placements.

Conclusion and Future Directions

While DTC represents a substantial leap in virtual try-on technology, it is not without limitations. Issues persist with fine-grained text and full-body product imagery, warranting further research into incorporating auxiliary inputs like pose detection. Nevertheless, DTC's proficiency in aligning real-time performance with high detail retention underscores its potential as a transformative tool in e-commerce, enabling consumers to interact with products embedded dynamically within their personal spaces. The implications for enhanced consumer experiences are significant, offering a tangible utility path from product viewing to placement within user environment images without manual image editing.

Markdown