Fashion Style Editing with Generative Human Prior (2404.01984v1)
Abstract: Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Using latent space regression to analyze and leverage compositionality in gans. arXiv preprint arXiv:2103.10426, 2021.
- Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12(1):18958, 2022.
- Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Fashion editing with adversarial parsing learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8120–8128, 2020.
- Stylegan-human: A data-centric odyssey of human generation. arXiv preprint, arXiv:2204.11823, 2022.
- Unitedhuman: Harnessing multi-source data for high-resolution human generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7301–7311, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
- Self-distilled self-supervised representation learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2829–2839, 2023a.
- Unifying vision-language representation space with single-tower transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 980–988, 2023b.
- Fashion style generator. In IJCAI, pages 3721–3727, 2017.
- Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Training generative adversarial networks with limited data. Advances in neural information processing systems, 33:12104–12114, 2020a.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020b.
- Alias-free generative adversarial networks. Advances in neural information processing systems, 34:852–863, 2021.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Analyzing multimodal objectives through the lens of generative diffusion guidance. arXiv preprint arXiv:2302.10305, 2023.
- Few-shot image generation with mixup-based distance learning. In European conference on computer vision, pages 563–580. Springer, 2022.
- Leveraging off-the-shelf diffusion model for multi-attribute fashion image manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 848–857, 2023.
- Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.
- Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022a.
- Tailor me: An editing network for fashion attribute shape manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3831–3840, 2022b.
- Explaining in style: Training a gan to explain a classifier in stylespace. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 693–702, 2021.
- High-resolution virtual try-on with misalignment and occlusion-handled conditions. In European Conference on Computer Vision, pages 204–219. Springer, 2022.
- Aadiff: Audio-aligned video synthesis with text-to-image diffusion. arXiv preprint arXiv:2305.04001, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Virtual try-on with pose-garment keypoints guided inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22788–22797, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Dress code: high-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2231–2235, 2022.
- Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2085–2094, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
- Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317, 2018.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International conference on machine learning, pages 30105–30118. PMLR, 2023.
- Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1532–1540, 2021.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
- Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022.
- Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12863–12872, 2021.
- Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. arXiv preprint arXiv:2403.01779, 2024.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Humandiffusion: a coarse-to-fine alignment diffusion framework for controllable text-driven person image generation. arXiv preprint arXiv:2211.06235, 2022.
- Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.