Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing (2403.14828v2)
Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
- M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg, “Where to buy it: Matching street clothing photos in online shops,” in ICCV, 2015.
- Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations,” in CVPR, 2016.
- G. Cartella, A. Baldrati, D. Morelli, M. Cornia, M. Bertini, and R. Cucchiara, “OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data,” in ICIAP, 2023.
- D. Morelli, M. Cornia, and R. Cucchiara, “FashionSearch++: Improving consumer-to-shop clothes retrieval with hard negatives,” in CEUR Workshop Proceedings, 2021.
- W.-L. Hsiao and K. Grauman, “Creating capsule wardrobes from fashion images,” in CVPR, 2018.
- G. Cucurull, P. Taslakian, and D. Vazquez, “Context-aware visual compatibility prediction,” in CVPR, 2019.
- R. Sarkar, N. Bodla, M. I. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “OutfitTransformer: Learning Outfit Representations for Fashion Recommendation,” in WACV, 2023.
- X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “VITON: An Image-Based Virtual Try-On Network,” in CVPR, 2018.
- B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang, “Toward characteristic-preserving image-based virtual try-on network,” in ECCV, 2018.
- H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo, “Towards photo-realistic virtual try-on by adaptively generating-preserving image content,” in CVPR, 2020.
- D. Morelli, M. Fincato, M. Cornia, F. Landi, F. Cesari, and R. Cucchiara, “Dress Code: High-Resolution Multi-Category Virtual Try-On,” in ECCV, 2022.
- S. Choi, S. Park, M. Lee, and J. Choo, “VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization,” in CVPR, 2021.
- D. Morelli, A. Baldrati, G. Cartella, M. Cornia, M. Bertini, and R. Cucchiara, “LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On,” in ACM Multimedia, 2023.
- S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy, “Be Your Own Prada: Fashion Synthesis with Structural Coherence,” in ICCV, 2017.
- Y. Jiang, S. Yang, H. Qju, W. Wu, C. C. Loy, and Z. Liu, “Text2human: Text-driven controllable human image generation,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–11, 2022.
- M. Pernuš, C. Fookes, V. Štruc, and S. Dobrišek, “FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion,” arXiv preprint arXiv:2301.02110, 2023.
- J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” in NeurIPS, 2020.
- P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” in NeurIPS, 2021.
- A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, 2021.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis With Latent Diffusion Models,” in CVPR, 2022.
- A. Baldrati, D. Morelli, G. Cartella, M. Cornia, M. Bertini, and R. Cucchiara, “Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing,” in ICCV, 2023.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,” in CVPR, 2018.
- M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis,” in CVPR, 2019.
- H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” in CVPR, 2021.
- M. Tao, H. Tang, F. Wu, X.-Y. Jing, B.-K. Bao, and C. Xu, “DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis,” in CVPR, 2022.
- A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,” in ICML, 2022.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv preprint arXiv:2204.06125, 2022.
- D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis,” arXiv preprint arXiv:2307.01952, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” in ICML, 2021.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, R. Gontijo-Lopes, B. K. Ayan, T. Salimans et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,” in NeurIPS, 2022.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, vol. 21, no. 1, pp. 5485–5551, 2020.
- J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models,” in ICCV, 2021.
- C. Meng, Y. H. adnd Yang Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations,” in ICLR, 2022.
- T. Wang, T. Zhang, B. Zhang, H. Ouyang, D. Chen, Q. Chen, and F. Wen, “Pretraining is All You Need for Image-to-Image Translation,” arXiv preprint arXiv:2205.12952, 2022.
- C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models,” arXiv preprint arXiv:2302.08453, 2023.
- S.-I. Cheng, Y.-J. Chen, W.-C. Chiu, H.-Y. Tseng, and H.-Y. Lee, “Adaptively-Realistic Image Generation from Stroke and Sketch with Diffusion Model,” in WACV, 2023.
- A. Voynov, K. Aberman, and D. Cohen-Or, “Sketch-Guided Text-to-Image Diffusion Models,” in ACM SIGGRAPH, 2023.
- L. Zhang and M. Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models,” in ICCV, 2023.
- H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
- R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion,” in ICLR, 2023.
- I. Han, S. Yang, T. Kwon, and J. C. Ye, “Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion,” arXiv preprint arXiv:2303.08767, 2023.
- N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation,” in CVPR, 2023.
- G. Daras and A. G. Dimakis, “Multiresolution Textual Inversion,” in NeurIPS Workshops, 2022.
- R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text Inversion for Editing Real Images using Guided Diffusion Models,” in CVPR, 2023.
- N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-Concept Customization of Text-to-Image Diffusion,” in CVPR, 2023.
- M. Fincato, F. Landi, M. Cornia, F. Cesari, and R. Cucchiara, “VITON-GT: An Image-based Virtual Try-On Model with Geometric Transformations,” in ICPR, 2020.
- T. Issenhuth, J. Mary, and C. Calauzenes, “Do Not Mask What You Do Not Need to Mask: A Parser-Free Virtual Try-On,” in ECCV, 2020.
- Y. Ge, Y. Song, R. Zhang, C. Ge, W. Liu, and P. Luo, “Parser-free virtual try-on via distilling appearance flows,” in CVPR, 2021.
- B. Fele, A. Lampe, P. Peer, and V. Struc, “C-VTON: Context-Driven Image-Based Virtual Try-On Network,” in WACV, 2022.
- X. Li, M. Kampffmeyer, X. Dong, Z. Xie, F. Zhu, H. Dong, X. Liang et al., “WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on,” arXiv preprint arXiv:2312.03667, 2023.
- J. Gou, S. Sun, J. Zhang, J. Si, C. Qian, and L. Zhang, “Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow,” in ACM Multimedia, 2023.
- J. Zeng, D. Song, W. Nie, H. Tian, T. Wang, and A. Liu, “CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model,” arXiv preprint arXiv:2311.18405, 2023.
- J. Kim, G. Gu, M. Park, S. Park, and J. Choo, “StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On,” arXiv preprint arXiv:2312.01725, 2023.
- J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
- A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, “Zero-Shot Composed Image Retrieval with Textual Inversion,” ICCV, 2023.
- R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-Robust Large Mask Inpainting With Fourier Convolutions,” in WACV, 2022.
- A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “RePaint: Inpainting Using Denoising Diffusion Probabilistic Models,” in CVPR, 2022.
- W. Li, Z. Lin, K. Zhou, L. Qi, Y. Wang, and J. Jia, “MAT: Mask-Aware Transformer for Large Hole Image Inpainting,” in CVPR, 2022.
- Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers,” arXiv preprint arXiv:2211.01324, 2022.
- A. Voynov, Q. Chu, D. Cohen-Or, and K. Aberman, “P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation,” arXiv preprint arXiv:2303.09522, 2023.
- O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh, D. Lischinski, O. Fried, and X. Yin, “SpaText: Spatio-Textual Representation for Controllable Image Generation,” in CVPR, 2023.
- F. Bianchi, J. Tagliabue, and B. Yu, “Query2Prod2Vec: Grounded word embeddings for eCommerce,” in NAACL, 2021.
- H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, and R. Feris, “Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback,” in CVPR, 2021.
- X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis, “Automatic Spatially-Aware Fashion Concept Discovery,” in ICCV, 2017.
- M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong et al., “Robust fine-tuning of zero-shot models,” in CVPR, 2022.
- C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: An open large-scale dataset for training next generation image-text models,” in NeurIPS, 2022.
- Z. Su, W. Liu, Z. Yu, D. Hu, Q. Liao, Q. Tian, M. Pietikäinen, and L. Liu, “Pixel difference networks for efficient edge detection,” in ICCV, 2021.
- I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” in CVPR, 2017.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in MICCAI, 2015.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in ICLR, 2019.
- P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed Precision Training,” in ICLR, 2018.
- D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” JMLR, vol. 15, no. 56, pp. 1929–1958, 2014.
- J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” in ICLR, 2021.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a Nash equilibrium,” in NeurIPS, 2017.
- M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” in ICLR, 2018.
- G. Parmar, R. Zhang, and J.-Y. Zhu, “On Aliased Resizing and Surprising Subtleties in GAN Evaluation,” in CVPR, 2022.
- J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “CLIPScore: A Reference-free Evaluation Metric for Image Captioning,” in EMNLP, 2021.
- N. S. Detlefsen, J. Borovec, J. Schock, A. H. Jha, T. Koker, L. Di Liello, D. Stancl, C. Quan, M. Grechkin, and W. Falcon, “TorchMetrics-Measuring Reproducibility in PyTorch,” J. of Open Source Software, vol. 7, no. 70, p. 4101, 2022.
- S. Kreiss, L. Bertoni, and A. Alahi, “OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association,” IEEE Trans. on ITS, vol. 23, no. 8, pp. 13 498–13 511, 2021.
- Alberto Baldrati (12 papers)
- Davide Morelli (10 papers)
- Marcella Cornia (61 papers)
- Marco Bertini (38 papers)
- Rita Cucchiara (142 papers)