Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing (2304.02051v2)

Published 4 Apr 2023 in cs.CV, cs.AI, and cs.MM

Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer.

Authors (6)

Alberto Baldrati (12 papers)
Davide Morelli (10 papers)
Giuseppe Cartella (6 papers)
Marcella Cornia (61 papers)
Marco Bertini (38 papers)
Rita Cucchiara (142 papers)

Citations (42)

View on Semantic Scholar

Summary

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

The research paper titled "Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing" introduces a novel framework for fashion image editing, utilizing latent diffusion models (LDMs) conditioned on multimodal inputs. This approach marks a significant advancement in the application of diffusion models within the fashion domain, emphasizing the integration of text, human body poses, and garment sketches into the generative process.

Methodology

The paper proposes the task of multimodal-conditioned fashion image editing, wherein a model's identity and body shape are preserved while substituting the target garment based on multimodal prompts. The centerpiece of this work is the Multimodal Garment Designer (MGD) architecture, which innovatively applies latent diffusion models—a method not previously explored extensively in the fashion domain—to guide the generation process through multimodal inputs.

Key aspects of the methodology include:

Human-Centric Inpainting: The framework employs human-centric inpainting by using pose maps to maintain the original human pose and identity during the garment substitution process. This is particularly crucial in ensuring the generated images retain natural and coherent depictions of human figures.
Sketch Integration: Sketches serve as an additional modality to enhance the textual input. This enables refined control over the garment's spatial characteristics, aiding in the generation of more accurate and customized fashion designs.
Conditional Guidance: The network is trained to predict noise stochastically added to the latent space representation, with conditions provided through multimodal inputs. This results in a guided diffusion process that effectively synthesizes coherent and high-quality fashion images.

Dataset and Evaluation

To support their framework, the authors extend two existing datasets, Dress Code and VITON-HD, with multimodal annotations. These datasets facilitate a robust evaluation of the model's ability to generate realistic and coherent fashion images. Experimental results demonstrate the effectiveness of MGD in outperforming conventional GAN-based and other diffusion model strategies in both paired and unpaired settings.

The evaluation employs metrics such as FID, KID, and a novel sketch and pose distance metric to quantify realism and adherence to input modalities. The results are corroborated through human evaluations, reaffirming the model's superior performance in generating realistic fashion images that align with the provided multimodal prompts.

Implications and Future Directions

This work has several significant implications:

Theoretical Advancement: By integrating multiple input modalities, the paper extends the applicability of LDMs, showcasing their potential in complex image generation tasks beyond simple textual descriptions.
Practical Application: The framework can potentially transform the fashion design process, offering designers a powerful tool for generating customized garment visualizations that closely follow prototype sketches and design briefs.
Broader Impact on AI Research: This paper opens avenues for further exploration of multimodal diffusion models in other creative domains, such as architecture and interior design, where human-centric considerations and multimodal inputs are crucial.

Future research may focus on enhancing the fine-grained control capabilities by incorporating additional modalities such as textile patterns or 3D garment data. Moreover, exploring the efficiency and scalability of the model in large-scale applications could enable widespread use in industry settings.

In conclusion, the "Multimodal Garment Designer" lays a foundational framework for future research and applications in fashion and creative AI domains, offering insights into achieving high levels of realism and coherence in generative tasks.

PDF Markdown

Related Papers

GitHub

GitHub - aimagelab/multimodal-garment-designer: This is the official repository for the paper "Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing". ICCV 2023 (408 stars)